Comment by yobbo
2 months ago
As I understand it, the token embedding stream would be equivalent to multi-channel sampled waveforms. The model either needs to learn the embeddings by back-propagating through FFT and IFFT, or use some suitable tokenization scheme which the paper doesn't discuss (?).
It seems unlikely to work for language.
No comments yet
Contribute on Hacker News ↗