Comment by onlyrealcuzzo

19 hours ago

I just tested this on a bug fixing benchmark I'm working on.

It did not perform as well as I expected. Qwen2.5-Coder-3B (2 years old) outperformed it by a wide range -> fixing ~50% of bugs whereas this model only fixed ~12%.

Granted, it's not a coder specific model, but given its benchmark performance to Gemma models, and that it's two years newer, and that it's an MoE with 8B total params, I expected it to be more competitive.

I personally find any model smaller than something like Qwen 3.6 35B-A3B (8-bit quantization, about 49GB memory usage when loaded into llama.cpp) to be too "stupid" for reliable use.

I would much rather not run the model on my local laptop hardware and offload that to some system sitting under my desk in my home office, accessible via VPN, than take the risk of using an unreliable and flaky tool for the convenience of having it on the same hardware on my lap.

I pay very little attention to 8 billion or whatever (or even much smaller) models these days and I don't feel like I'm missing much.

  • Have you seen the 8bit quantisation matter a lot? The "consensus" in r/LocalLlama is that up to 4 bits the loss is tolerable.

    • It’s not a general rule, and depends highly on the model and the quantisation used. Don’t guess, Unsloth sometimes publish graphs in their tutorials showing the error rate vs file size… sometimes Q4 is great, other times I go for Q6

    • Absolutely. Difference in Q6 vs Q8 is not as immediately noticeable, but if I test by starting from a blank slate context and giving it the same complicated task with Q4 vs a Q8 GGUF file loaded, the difference is apparent. The Q4 will struggle or do 'stupid' things with even simple bash or python. Q4 might not be as noticeable for conversational purely text one on one interaction with an LLM, but when you dig deeper into something that's more esoteric in a training dataset than a chat conversation, absolutely a big gap there.

      I think some of the folks in the local llm social media communities are using them for things like company-hosted customer service chat bots, or purely english text writing stuff where Q4 will probably not cause a problem. For more discrete technical work I stick pretty much exclusively to Q8.

      2 replies →

That's not all that surprising, IMO. From what I understand, LiquidAI is focusing pretty narrowly on building models that operate as the "agentic core" of a larger system.

If I were going to use this model, I'd be looking to use it more as is the primary chat interface of a larger system, and having it orchestrate & delegate tasks to other places via tool calls. It's not quite as exciting on the surface as a local "do it all" model, but it does enable some pretty neat use-cases, IMO.

I'm imagining a local agent that is super low latency, works entirely offline, and capable of queuing up complex tasks for larger/smarter cloud agents which execute them asynchronously.

I tried it with OpenCode and it is borderline incapable of using tool calls, so that might be why it is doing so bad on your test.

I will test it when it's accessible via OpenRouter, but the previous LFM2 model (lfm-2-24b-a2b) didn't do well on my tests, it got only 1/20 questions/tasks right, way below Gemma 31B or Qwen 35b-a3b (those get like 10/20 right)

  • I tested it against Gemma 4 31B and it's expectedly not favorable for world knowledge.

    But even against E4B it's shaky, which is surprising given how many tokens they trained on. I guess it was on a lot of synthetic data.