Comment by mchinen

6 hours ago

The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

  • Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.

  • Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

  • Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.