Comment by walrus01

17 hours ago

I personally find any model smaller than something like Qwen 3.6 35B-A3B (8-bit quantization, about 49GB memory usage when loaded into llama.cpp) to be too "stupid" for reliable use.

I would much rather not run the model on my local laptop hardware and offload that to some system sitting under my desk in my home office, accessible via VPN, than take the risk of using an unreliable and flaky tool for the convenience of having it on the same hardware on my lap.

I pay very little attention to 8 billion or whatever (or even much smaller) models these days and I don't feel like I'm missing much.

13 comments

walrus01

satvikpendem 16 hours ago

Qwen 3.6 27B dense is much better than the 35B MoE model for coding, not sure if you've tried that yet.

sheeshkebab 3 hours ago

27b is slow as molasses vs 35b on local stuff I have (m5 max). Mtp doesn’t make any difference either.
walrus01 16 hours ago
yes, I have, I use both. 27B slower in tok/s due to density, obviously, 35B-A3B for speed on simpler tasks.
- intothemild 9 hours ago
  
  You should enable MTP now that its available.
  LLamaCPP has had some massive updates in the last week or so.
  
  3 replies →

theanonymousone 12 hours ago

Have you seen the 8bit quantisation matter a lot? The "consensus" in r/LocalLlama is that up to 4 bits the loss is tolerable.

walrus01 12 hours ago
Absolutely. Difference in Q6 vs Q8 is not as immediately noticeable, but if I test by starting from a blank slate context and giving it the same complicated task with Q4 vs a Q8 GGUF file loaded, the difference is apparent. The Q4 will struggle or do 'stupid' things with even simple bash or python. Q4 might not be as noticeable for conversational purely text one on one interaction with an LLM, but when you dig deeper into something that's more esoteric in a training dataset than a chat conversation, absolutely a big gap there.
I think some of the folks in the local llm social media communities are using them for things like company-hosted customer service chat bots, or purely english text writing stuff where Q4 will probably not cause a problem. For more discrete technical work I stick pretty much exclusively to Q8.
- theanonymousone 9 hours ago
  
  Thanks a lot. How about Q8 vs FP16/BF16? Have you checked them too?
  
  1 reply →
alfiedotwtf 1 hour ago

It’s not a general rule, and depends highly on the model and the quantisation used. Don’t guess, Unsloth sometimes publish graphs in their tutorials showing the error rate vs file size… sometimes Q4 is great, other times I go for Q6

thot_experiment 16 hours ago

q6 is fine for that qwen with ctx @ q8, and the dense models of that size are solid at q4 with q8 ctx