Comment by miven
15 days ago
According to [0] it's partly due to a key change they introduced in interleaving layers that use standard RoPE positional encodings and layers using what's called NoPE [1], not encoding positions at all and letting the model to figure those out on its own (this exclusively works because the LLMs are autoregressive, so the model can recognize an input token as being the very first by there not yet being any other tokens to attend to, and recursively deriving the position of the subsequent ones from that base case)
[0] https://ai.meta.com/blog/llama-4-multimodal-intelligence/ [1] https://arxiv.org/abs/2305.19466
No comments yet
Contribute on Hacker News ↗