Comment by echelon
3 days ago
1. Any non-OpenAI, non-Google, non-ElevenLabs player is going to have to aggressively open source or they'll become 100% irrelevant. The TTS market leaders are obvious and deeply entrenched, and Resemble, Play(HT), et al. have to aggressively cater to developers by offering up their weights [1].
2. This is CYA for that. Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).
[1] This is the right way to do it. Offer source code and weights, offer their own API/fine tuning so developers don't have to deal with the hassle. That's how they win back some market share.
[2] https://www.404media.co/wikipedia-pauses-ai-generated-summar...
Nevermind, this is just ~3/10 open, or not really open at all [1]:
https://github.com/resemble-ai/chatterbox/issues/45#issuecom...
> For now, that means we’re not releasing the training code, and fine-tuning will be something we support through our paid API (https://app.resemble.ai). This helps us pay the bills and keep pushing out models that (hopefully) benefit everyone.
Big bummer here, Resemble. This is not at all open.
For everyone stumbling upon this, there are better "open weights" models than Resemble's Chatterbox TTS:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
These are really good robust models that score higher in openness.
Unfortunately only Seed-VC is fully open. But all of the above still beat Resemble's Chatterbox in zero shot MOS (we tested a lot), especially the mega-OP Chinese models.
(ByteDance slaps with all things AI. Their new secretive video model is better than Veo 3, if you haven't already seen it [2]!)
You can totally ignore this model masquerading as "open". Resemble isn't really being generous at all here, and this is some cheap wool over the eyes trickery. They know they retain all of the cards here, and really - if you're just going to use an API, why not just use ElevenLabs?
Shame on y'all, Resemble. This isn't "open" AI.
The Chinese are going to wipe the floor with TTS. ByteDance released their model in a more open manner than yours, and it sounds way better and generalizes to voices with higher speaker similarity.
Playing with open source is a path forward, but it has to be in good faith. Please do better.
[1] "10/10" open includes: 1. model code, 2. training code, 3. fine tuning code, 4. inference code, 5. raw training data, 6. processed training data, 7. weights, 8. license to outputs, 9. research paper, 10. patents. For something to be a good model, it should have 7/10 or above.
[2] https://artificialanalysis.ai/text-to-video/arena?tab=leader...
The weights are indeed open (both accessible and licensing-wise): you don't need to put that in square quotes. Training code is not. You can fine-tune the weights yourself with your own training code. Saying that isn't open is like saying ffmpeg isn't open because it doesn't do everything I need it to do and I have to wrap it with own code to achieve my goals.
It really weird to say ByteDance’s release is “more open” when the WaveVAE encoder isn't released at all, only the decoder, so new voices require submitting your sample to a public GDrive folder and getting extracted latents back through another public GDrive folder.
FYI, the term is scare quotes (because they imply suspicion), not square quotes
Machine learning assets are not binary "open" or "closed". There is a continuum of openness.
To make a really poor analogy, this repo is like a version of Linux that you can't cross-compile or port.
To make another really poor (but fitting) analogy, this is like an "open core" SaaS platform that you know you'll never be able to run the features that matter on your own.
This repo scores really low on the "openness" continuum. In this case, you're very limited in what you can do with Chatterbox TTS. You certainly can't improve it or fit it to your data.
> You can fine-tune the weights yourself with your own training code.
This will never be built by anyone, and they know that. If it could be, they'd provide it themselves.
If you're considering Chatterbox TTS, just use MegaTTS3 [1] instead. It's better by all accounts.
[1] https://github.com/bytedance/MegaTTS3
6 replies →
Cant make everyone happy :)
This space is getting pretty crowded.
If you're going to drop weights on unsuspecting developers (who might not be familiar with TTS) and make them think that they'll fit their use case, that's a bit of a bait-and-switch.
Chatterbox TTS is only available over API for fine tunes. That's an incredibly saturated market, and there are better quality and cheaper models for this.
Chatterbox TTS is equivalent to already-released semi-open weights from ByteDance and other labs, and those models already sound and perform better.
It'd be truly exciting if Chatterbox fine tunes could be done as open weights, similar to how Flux operates. Black Forest Labs has an entire open weights ecosystem built around them. While they do withhold their pro / highest quality variants, they always release open weights with training code for each commercial release. That's a much better model for courting open source developers.
Another company doing "open weights" right is Lightricks with LTX-1. They have a commercial studio, but they release all of their weights and tuning code in the open.
I don't see how this is a carrot for open source. It's an ad for the hosted API.
not a single top-tier lab has a "10/10 open" model for any model type for any learning application since ResNet, it's not fair to shit on them solely for this
>Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).
it is highly amusing that they still believe they can put that genie back in the bottle with their usual crybully bullshit.
Some measures like that still sort of work. Try loading a scanned picture of a dollar bill into Photoshop. Try printing it on a color printer. Try printing anything on a coor printer without the yellow tracking pixels.
A lock needs not be infinitely strong to be useful, it just needs to take more resources to crack it than the locked thing is worth.
[dead]