Comment by syntaxing
17 hours ago
The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.
17 hours ago
The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.
Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.
Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.
I would venture to suggest that to read it as "Qwen made MoEs in toto || first || better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs
In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.
Strong disagree.
Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
https://artificialanalysis.ai/#frontier-language-model-intel...
Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
In essence, it was Deepseek R1 before Deepseek R1.
He is definitely talking about Llama4.
3 replies →
It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.
Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...
Ok wow that is incredibly interesting, what a test. I would've honestly expected just random noise (like if you gave this same task a human, lol) but you can even see related models draw similar results. Maybe it is an indicator of overall knowledge, or how consistent the world model is. It also could not correlate at all with non-geographical knowledge.