Comment by teraflop
3 days ago
> Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.
Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...
I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?
Possibly a sort of CYA gesture, kinda like how original Stable Diffusion had a content filter IIRC. Could also just be to prevent people from accidentally getting peanut butter in the toothpaste WRT training data, too.
Stable Diffusion or rather Automatic1111 which was initially the UI of choice for SD models had a joke/fake "watermark" setting too which was deliberately doing nothing besides poking fun at people who were thinking that open source projects would really waste time on developing something that could easily be stripped/reverted by the virtue of being open source anyways.
Yeah, there's even a flag to turn it off in the parser `--no-watermark`. I assumed they added it for downstream users pulling it in as a "feature" for their larger product.
1. Any non-OpenAI, non-Google, non-ElevenLabs player is going to have to aggressively open source or they'll become 100% irrelevant. The TTS market leaders are obvious and deeply entrenched, and Resemble, Play(HT), et al. have to aggressively cater to developers by offering up their weights [1].
2. This is CYA for that. Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).
[1] This is the right way to do it. Offer source code and weights, offer their own API/fine tuning so developers don't have to deal with the hassle. That's how they win back some market share.
[2] https://www.404media.co/wikipedia-pauses-ai-generated-summar...
Nevermind, this is just ~3/10 open, or not really open at all [1]:
https://github.com/resemble-ai/chatterbox/issues/45#issuecom...
> For now, that means we’re not releasing the training code, and fine-tuning will be something we support through our paid API (https://app.resemble.ai). This helps us pay the bills and keep pushing out models that (hopefully) benefit everyone.
Big bummer here, Resemble. This is not at all open.
For everyone stumbling upon this, there are better "open weights" models than Resemble's Chatterbox TTS:
Zeroshot TTS: MaskGCT, MegaTTS3
Zeroshot VC: Seed-VC, MegaTTS3
These are really good robust models that score higher in openness.
Unfortunately only Seed-VC is fully open. But all of the above still beat Resemble's Chatterbox in zero shot MOS (we tested a lot), especially the mega-OP Chinese models.
(ByteDance slaps with all things AI. Their new secretive video model is better than Veo 3, if you haven't already seen it [2]!)
You can totally ignore this model masquerading as "open". Resemble isn't really being generous at all here, and this is some cheap wool over the eyes trickery. They know they retain all of the cards here, and really - if you're just going to use an API, why not just use ElevenLabs?
Shame on y'all, Resemble. This isn't "open" AI.
The Chinese are going to wipe the floor with TTS. ByteDance released their model in a more open manner than yours, and it sounds way better and generalizes to voices with higher speaker similarity.
Playing with open source is a path forward, but it has to be in good faith. Please do better.
[1] "10/10" open includes: 1. model code, 2. training code, 3. fine tuning code, 4. inference code, 5. raw training data, 6. processed training data, 7. weights, 8. license to outputs, 9. research paper, 10. patents. For something to be a good model, it should have 7/10 or above.
[2] https://artificialanalysis.ai/text-to-video/arena?tab=leader...
The weights are indeed open (both accessible and licensing-wise): you don't need to put that in square quotes. Training code is not. You can fine-tune the weights yourself with your own training code. Saying that isn't open is like saying ffmpeg isn't open because it doesn't do everything I need it to do and I have to wrap it with own code to achieve my goals.
9 replies →
Cant make everyone happy :)
1 reply →
not a single top-tier lab has a "10/10 open" model for any model type for any learning application since ResNet, it's not fair to shit on them solely for this
>Without watermarking, there will be cries from the media about abuse (from anti-AI outfits like 404Media [2] especially).
it is highly amusing that they still believe they can put that genie back in the bottle with their usual crybully bullshit.
Some measures like that still sort of work. Try loading a scanned picture of a dollar bill into Photoshop. Try printing it on a color printer. Try printing anything on a coor printer without the yellow tracking pixels.
A lock needs not be infinitely strong to be useful, it just needs to take more resources to crack it than the locked thing is worth.
1 reply →
[dead]