Comment by cyanf

8 months ago

you use a neural audio codec to encode audio into codebooks

then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task

you then take the codebook entries generated and run it through the codec’s decoder and yield audio

it works surprisingly well

speech text models (tts model with an llm as backbone) is the current meta

0 comments

cyanf

No comments yet