Comment by echelon
3 days ago
Nevermind, this is just ~3/10 open, or not really open at all [1]:
https://github.com/resemble-ai/chatterbox/issues/45#issuecom...
> For now, that means we’re not releasing the training code, and fine-tuning will be something we support through our paid API (https://app.resemble.ai). This helps us pay the bills and keep pushing out models that (hopefully) benefit everyone.
Big bummer here, Resemble. This is not at all open.
For everyone stumbling upon this, there are better "open weights" models than Resemble's Chatterbox TTS:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
These are really good robust models that score higher in openness.
Unfortunately only Seed-VC is fully open. But all of the above still beat Resemble's Chatterbox in zero shot MOS (we tested a lot), especially the mega-OP Chinese models.
(ByteDance slaps with all things AI. Their new secretive video model is better than Veo 3, if you haven't already seen it [2]!)
You can totally ignore this model masquerading as "open". Resemble isn't really being generous at all here, and this is some cheap wool over the eyes trickery. They know they retain all of the cards here, and really - if you're just going to use an API, why not just use ElevenLabs?
Shame on y'all, Resemble. This isn't "open" AI.
The Chinese are going to wipe the floor with TTS. ByteDance released their model in a more open manner than yours, and it sounds way better and generalizes to voices with higher speaker similarity.
Playing with open source is a path forward, but it has to be in good faith. Please do better.
[1] "10/10" open includes: 1. model code, 2. training code, 3. fine tuning code, 4. inference code, 5. raw training data, 6. processed training data, 7. weights, 8. license to outputs, 9. research paper, 10. patents. For something to be a good model, it should have 7/10 or above.
[2] https://artificialanalysis.ai/text-to-video/arena?tab=leader...
The weights are indeed open (both accessible and licensing-wise): you don't need to put that in square quotes. Training code is not. You can fine-tune the weights yourself with your own training code. Saying that isn't open is like saying ffmpeg isn't open because it doesn't do everything I need it to do and I have to wrap it with own code to achieve my goals.
It really weird to say ByteDance’s release is “more open” when the WaveVAE encoder isn't released at all, only the decoder, so new voices require submitting your sample to a public GDrive folder and getting extracted latents back through another public GDrive folder.
FYI, the term is scare quotes (because they imply suspicion), not square quotes
Machine learning assets are not binary "open" or "closed". There is a continuum of openness.
To make a really poor analogy, this repo is like a version of Linux that you can't cross-compile or port.
To make another really poor (but fitting) analogy, this is like an "open core" SaaS platform that you know you'll never be able to run the features that matter on your own.
This repo scores really low on the "openness" continuum. In this case, you're very limited in what you can do with Chatterbox TTS. You certainly can't improve it or fit it to your data.
> You can fine-tune the weights yourself with your own training code.
This will never be built by anyone, and they know that. If it could be, they'd provide it themselves.
If you're considering Chatterbox TTS, just use MegaTTS3 [1] instead. It's better by all accounts.
[1] https://github.com/bytedance/MegaTTS3
> This will never be built by anyone, and they know that. If it could be, they'd provide it themselves.
Community fine-tuning code has been developed in the past for open-weights models without public first-party training code.
Why can't you improve it or fit it to your data?
This can be cross-compiled/ported in the Linux analogy. The Linux analogy would be more like: a kernel dev wrote code for some part of the Linux kernel using JetBrains' CLion. He used features of CLion that made this process much easer than if he had written the code using `nano`. By your logic, the resulting kernel code is not "open" because the tooling used to create it is not open. This is, of course, nonsense.
I agree that the project as a whole is less open than it could be, but the weights are indeed as open as they can be, no scare quotes required.
4 replies →
Cant make everyone happy :)
This space is getting pretty crowded.
If you're going to drop weights on unsuspecting developers (who might not be familiar with TTS) and make them think that they'll fit their use case, that's a bit of a bait-and-switch.
Chatterbox TTS is only available over API for fine tunes. That's an incredibly saturated market, and there are better quality and cheaper models for this.
Chatterbox TTS is equivalent to already-released semi-open weights from ByteDance and other labs, and those models already sound and perform better.
It'd be truly exciting if Chatterbox fine tunes could be done as open weights, similar to how Flux operates. Black Forest Labs has an entire open weights ecosystem built around them. While they do withhold their pro / highest quality variants, they always release open weights with training code for each commercial release. That's a much better model for courting open source developers.
Another company doing "open weights" right is Lightricks with LTX-1. They have a commercial studio, but they release all of their weights and tuning code in the open.
I don't see how this is a carrot for open source. It's an ad for the hosted API.
not a single top-tier lab has a "10/10 open" model for any model type for any learning application since ResNet, it's not fair to shit on them solely for this