← Back to context Comment by az226 3 days ago How does one train a TTS model with an LLM backbone? Practically, how does this work? 1 comment az226 Reply cyanf 3 days ago you use a neural audio codec to encode audio into codebooksthen you could treat the codebook entries as tokens and treat audio generation as a next token prediction taskyou then take the codebook entries generated and run it through the codec’s decoder and yield audioit works surprisingly wellspeech text models (tts model with an llm as backbone) is the current meta
cyanf 3 days ago you use a neural audio codec to encode audio into codebooksthen you could treat the codebook entries as tokens and treat audio generation as a next token prediction taskyou then take the codebook entries generated and run it through the codec’s decoder and yield audioit works surprisingly wellspeech text models (tts model with an llm as backbone) is the current meta
you use a neural audio codec to encode audio into codebooks
then you could treat the codebook entries as tokens and treat audio generation as a next token prediction task
you then take the codebook entries generated and run it through the codec’s decoder and yield audio
it works surprisingly well
speech text models (tts model with an llm as backbone) is the current meta