Comment by flawn
14 days ago
10M Context Window with such a cheap performance WHILE having one of the top LMArena scores is really impressive.
The choice to have 128 experts is also unseen as far as I know, right? But seems to have worked pretty good as it seems.
I suppose the question is, are they also training a 288B x 128 expert (16T) model?
Llama 4 Colossus when?
What does it mean to have 128 experts? I feel like it's more 128 slightly dumb intelligences that average out to something expert-like.
Like, if you consulted 128 actual experts, you'd get something way better than any LLM output.
Let's see how that 10M context holds up, 128k pretrain is good indicator is not a scam but we're yet to see any numbers on this "iRoPE" architecture, at 17b active parameters and with 800G fabrics hitting the market, I think it could work, like I'm sure next year it'll be considered idiotic to keep K/V in actual memory.