Comment by syntaxing

17 hours ago

The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.

10 comments

syntaxing

halJordan 6 hours ago

Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.

Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.

refulgentis 5 hours ago

I would venture to suggest that to read it as "Qwen made MoEs in toto || first || better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs

moffkalast 16 hours ago

In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.

jychang 15 hours ago
Strong disagree.
Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
https://artificialanalysis.ai/#frontier-language-model-intel...
Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
In essence, it was Deepseek R1 before Deepseek R1.
- seunosewa 14 hours ago
  
  He is definitely talking about Llama4.
  
  3 replies →
NitpickLawyer 15 hours ago
It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.
Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...
- moffkalast 9 hours ago
  
  Ok wow that is incredibly interesting, what a test. I would've honestly expected just random noise (like if you gave this same task a human, lol) but you can even see related models draw similar results. Maybe it is an indicator of overall knowledge, or how consistent the world model is. It also could not correlate at all with non-geographical knowledge.