Sadly they don't publish any training or fine tuning code, so this isn't "open" in the way that Flux or Stable Diffusion are "open".
If you want better "open" models, these all sound better for zero shot:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
Granted, only Seed-VC has training/fine tuning code, but all of these models sound better than Chatterbox. So if you're going to deal with something you can't fine tune and you need a better zero shot fit to your voice, use one of these models instead. (Especially ByteDance's MegaTTS3. ByteDance research runs circles around most TTS research teams except for ElevenLabs. They've got way more money and PhD researchers than the smaller labs, plus a copious amount of training data.)
A bit on the nose that they used a sample from a professional voice actor (Jennifer English) as the default reference audio file in that huggingface tool.
Sadly they don't publish any training or fine tuning code, so this isn't "open" in the way that Flux or Stable Diffusion are "open".
If you want better "open" models, these all sound better for zero shot:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
Granted, only Seed-VC has training/fine tuning code, but all of these models sound better than Chatterbox. So if you're going to deal with something you can't fine tune and you need a better zero shot fit to your voice, use one of these models instead. (Especially ByteDance's MegaTTS3. ByteDance research runs circles around most TTS research teams except for ElevenLabs. They've got way more money and PhD researchers than the smaller labs, plus a copious amount of training data.)
Great tip. I hadn't heard of MegaTTS3.
But whats the inference speed like on these? Can you use them in a realtime interaction with an agent?
Fun to play with.
It makes my Australian accent sound very English though, in a posh RP way.
Very natural sounding, but not at all recreating my accent.
Still, amazingly clear and perfect for most TTS uses where you aren't actually impersonating anyone.
A bit on the nose that they used a sample from a professional voice actor (Jennifer English) as the default reference audio file in that huggingface tool.
How does it work from the privacy standpoint? Can they use recorded samples for training?