Comment by make3

5 hours ago

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

3 comments

make3

aesthesia 1 hour ago

Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.

neosat 5 hours ago

Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

mchinen 5 hours ago

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.