Comment by tucnak
14 days ago
Let's see how that 10M context holds up, 128k pretrain is good indicator is not a scam but we're yet to see any numbers on this "iRoPE" architecture, at 17b active parameters and with 800G fabrics hitting the market, I think it could work, like I'm sure next year it'll be considered idiotic to keep K/V in actual memory.
No comments yet
Contribute on Hacker News ↗