← Back to context

Comment by Tostino

11 hours ago

Yup, even with 2x 24gb GPUs, it's impossible to get anywhere close to the big models in terms of quality and speed, for a fraction of the cost.

I'm running unsloth/GLM-4.7-Flash-GGUF:UD-Q8_K_XL via llama.cpp on 2x 24G 4090s which fits perfectly with 198k context at 120 tokens/s – the model itself is really good.

  • I can confirm, running glm-4.7-flash-7e-qx54g-hi-mlx here, a 22gb model @q5 on m4 max pro and 59 tokens/s.