Comment by scosman

3 months ago

At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.

19 comments

scosman

Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.

behnamoh 3 months ago
I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it.
Apple should've invested more in bandwidth, but it's Apple and has lost its visionary. Imagine having 512GB on M3 Ultra and not being able to load even a 70B model on it at decent context window.
- 1ucky 3 months ago
  
  Prompt preprocessing is heavily compute-bound, so relying significantly on processing capabilities. Bandwidth mostly affects token generation speed.
- mirekrusin 3 months ago
  
  At 17B active params MoE should be much faster than monolithic 70B, right?
- nathancahill 3 months ago
  
  Imagine
lostmsu 3 months ago
At 4 bit quant (requires 64GB) the price of Mac (4.2K) is almost exactly the same as 2x5090 (provided we will see them in stock). But 2x5090 have 6x memory bandwidth and probably close to 50x matmul compute at int4.
- freehorse 3 months ago
  
  2.8k-3.6k for a 64gb-128gb mac studio (m3 max).
  
  2 replies →
refulgentis 3 months ago
Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)
- root_axis 3 months ago
  
  Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.
- jsnell 3 months ago
  
  Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.
  
  2 replies →
- vlovich123 3 months ago
  
  Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.
  That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.
- acchow 3 months ago
  
  Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so naturally they have reduced memory requirements when running quantized.
  As for CPUs, Intel can only go down to FP16, so you’ll be doing some “unpacking”. But hopefully that is “on the fly” and not when you load the model into memory?
- terhechte 3 months ago
  
  I just loaded two models of different quants into LM Studio:
  qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory
  qwen 2.5 coder 1.5b @ q8: 1.83 GB memory
  I always assumed this to be the case (also because of the smaller download sizes) but never really thought about it.
- michaelt 3 months ago
  
  No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.