Comment by duffyjp

14 days ago

Nothing. This summer I set up a dual 16GB GPU / 64GB RAM system and nothing I could run was even remotely close. Big models that didn't fit on 32gb VRAM had marginally better results but were at least of magnitude slower than what you'd pay for and still much worse in quality.

I gave one of the GPUs to my kid to play games on.

3 comments

duffyjp

Tostino 14 days ago

Yup, even with 2x 24gb GPUs, it's impossible to get anywhere close to the big models in terms of quality and speed, for a fraction of the cost.

mirekrusin 14 days ago
I'm running unsloth/GLM-4.7-Flash-GGUF:UD-Q8_K_XL via llama.cpp on 2x 24G 4090s which fits perfectly with 198k context at 120 tokens/s – the model itself is really good.
- fsiefken 14 days ago
  
  I can confirm, running glm-4.7-flash-7e-qx54g-hi-mlx here, a 22gb model @q5 on m4 max pro and 59 tokens/s.