← Back to context

Comment by cyanf

4 days ago

you use a neural audio codec to encode audio into codebooks

then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task

you then take the codebook entries generated and run it through the codec’s decoder and yield audio

it works surprisingly well

speech text models (tts model with an llm as backbone) is the current meta