Comment by kouteiheika
6 days ago
Llama 4 isn't that bad, but it was overhyped, and people in generally "hold it wrong".
I recently needed an LLM to batch process me some queries. I ran an ablation on 20+ models from Open Router to find the best one. Guess which ones got 100% accuracy? GPT-5-mini, Grok-4.1-fast and... Llama4 Scout. For comparison, DeepSeek v3.2 got 90%, and the community darling GLM-4.5-Air got 50%. Even the newest GLM-4.7 only got 70%.
Of course, this is just an anecdotal single datapoint which doesn't mean anything, but it shows that Llama 4 is probably underrated.
Oh, this is very interesting. Will have to test it out on coding too. Very good point about testing. Had I only followed benchmarks, I'd miss few gems completely (long context models and 4b vision models that are unbelievably capable for their size). I'd encourage anyone to test the models on actual problems you're working on.
The Llama 4 models were instruct models at a time when everyone was hyped about and expecting reasoning models. As instruct models, I agree they seemed fine, and I think Meta mostly dropped the ball by taking the negative community feedback as a signal that they should just give up. They’ve had plenty of time to train and release a Llama-4.5 by now, which could include reasoning variants and even stronger instruct models, and I think the community would have come around. Instead, it sounds like they’re focusing on closed source models that seem destined for obscurity, where Llama was at least widely known.
On the flip side, it also shows how damaging echo chambers can be, where relatively few people even gave the models a chance, just repeating the negativity they heard from other people and downvoting anyone who voiced a different experience.
I think this was exacerbated by the fact that Llama models had previously come in small, dense sizes like 8B that people could run on modest hardware, where even Llama 4 Scout was a large model that a lot of people in the community weren’t prepared to run. Large models seem more socially accepted now than they were when Llama 4 launched.
Large MoE models are more socially accepted because medium/large sized MoE models can still be quite small wrt. expert size (which is what sets the amount of required VRAM). But a large dense model is still challenging to get to run.
I meant large MoE models are more socially accepted now. They were not when Llama 4 launched, and I believe that worked against the Llama 4 models.
The Llama 4 models are MoE models, in case you are unaware, since it feels like your comment feels was implying they were dense models.