← Back to context

Comment by az226

3 days ago

How does one train a TTS model with an LLM backbone? Practically, how does this work?

you use a neural audio codec to encode audio into codebooks

then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task

you then take the codebook entries generated and run it through the codec’s decoder and yield audio

it works surprisingly well

speech text models (tts model with an llm as backbone) is the current meta