Comment by cyanf
4 days ago
you use a neural audio codec to encode audio into codebooks
then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task
you then take the codebook entries generated and run it through the codec’s decoder and yield audio
it works surprisingly well
speech text models (tts model with an llm as backbone) is the current meta
No comments yet
Contribute on Hacker News ↗