Comment by mappu
20 hours ago
I'm running Qwen3.6-35B-A3B on a very ordinary desktop PC (32GB DDR5, 8GB Radeon 6600XT) and getting a useful 15-20 tok/sec out of it. The MoE architecture and auto offloading from system to VRAM is just fantastic. Unsloth Q4_K_XL.
The Qwen3.6-27B is unbearably slow as it doesn't fit in VRAM, though, i think the MoE is very easy to run.
It is also extremely nice that you can just `apt install llama.cpp libggml0-backend-vulkan` now too.
I wonder what parent poster means with „useful” and what he actually tried? Feels like he was just comparing some benchmarks.
Yesterday I downloaded Gemma4-26B with Ollama on quite rusty desktop with 1070 8gb and 32gb of ram and Core i5-9400.
I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.
Earlier like in February I was trying the same photo with Gemma3 on the same hardware and results were bad.
> I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.
"Useful" as in "has a use that isn't just for show". It takes me two seconds to read a photo of a water meter. Having an LLM read it for me in 3 minutes isn't useful. Similarly small models are capable of tool use (e.g. web searches) but their synthesis leaves much to be desired. As an example I'd ask some small models to find examples of products with specific characteristics and they'd come back with only one or two because they discounted other possibilities incorrectly by reasoning themselves out of it.
> Feels like he was just comparing some benchmarks.
On what do you base this assertion?