Comment by rhdunn
1 day ago
It's all relative. For local use I'd classify it by hardware (VRAM size) using FP8 or Q6 quantization:
1. tiny <2-3B -- easily runnable on lower-spec hardware
2. small 4-8B -- runnable on 8GB GPUs
3. medium 9-12B -- runnable on 12GB GPUs
4. large 13-24B -- runnable on 16GB (for the lower end models) and 24GB GPUs
5. very large 25-32GB -- runnable on 32GB GPUs
6. huge >32GB -- not easily runnable on consumer GPUs without compromising performance (offloading layers to the CPU/RAM), quality (heavy quantization, esp. at <= Q4), or price (investing in multi-GPU setups and/or server-grade hardware).
You could possibly split huge down further, as 70GB models (e.g. llama 3) are easier to get working than >120GB models and 1TB models are completely intractable.
As a Mac user:
1. tiny <2-3B -- could run in a browser even, mac neo
2. small 4-8B -- last of browser options, MacBook Air base
3. medium 9-24B -- 32GB machine, air or pro notebook or mini
4. large 25-48B -- 64GB, pro notebook or mini
5. x-large 49-100B -- 128GB MacBook Pro or Studio
6. Huge > 100B -- 256/512GB Mac Studio
> tiny <2-3B -- could run in a browser even, mac neo
Or a phone. I’m running Gemma 4 E2B in one of my apps on my 14 pro (which may or may not be killing my display through overheating. It might just be a coincidence).