Comment by markisus

10 months ago

That's true in the realm of LLMs. But even in this case, the position information is added only into the first layer. Tokens in later layers can choose to "forget" this information. In addition there are applications of transformers in other domains. See https://github.com/cvg/LightGlue or https://facebookresearch.github.io/3detr/

2 comments

markisus

topwalktown 10 months ago

Transformers like Llama use rotary embeddings which are applied in every single attention layer

https://github.com/huggingface/transformers/blob/222505c7e4d...

markisus 10 months ago

Very interesting! Do you know if there were any studies about whether this improves performance?