Comment by rhdunn

1 day ago

It's all relative. For local use I'd classify it by hardware (VRAM size) using FP8 or Q6 quantization:

1. tiny <2-3B -- easily runnable on lower-spec hardware

2. small 4-8B -- runnable on 8GB GPUs

3. medium 9-12B -- runnable on 12GB GPUs

4. large 13-24B -- runnable on 16GB (for the lower end models) and 24GB GPUs

5. very large 25-32GB -- runnable on 32GB GPUs

6. huge >32GB -- not easily runnable on consumer GPUs without compromising performance (offloading layers to the CPU/RAM), quality (heavy quantization, esp. at <= Q4), or price (investing in multi-GPU setups and/or server-grade hardware).

You could possibly split huge down further, as 70GB models (e.g. llama 3) are easier to get working than >120GB models and 1TB models are completely intractable.

2 comments

rhdunn

sroussey 1 day ago

As a Mac user:

1. tiny <2-3B -- could run in a browser even, mac neo

2. small 4-8B -- last of browser options, MacBook Air base

3. medium 9-24B -- 32GB machine, air or pro notebook or mini

4. large 25-48B -- 64GB, pro notebook or mini

5. x-large 49-100B -- 128GB MacBook Pro or Studio

6. Huge > 100B -- 256/512GB Mac Studio

ElFitz 1 day ago

> tiny <2-3B -- could run in a browser even, mac neo
Or a phone. I’m running Gemma 4 E2B in one of my apps on my 14 pro (which may or may not be killing my display through overheating. It might just be a coincidence).