Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

1 day ago (github.com)

That's cool and useful.

IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).

[0] https://github.com/devnen/Chatterbox-TTS-Server

  • I quite like IndexTTS2 personally, it does voice cloning and also lets you modulate emotion manually through emotion vectors which I've found quite a powerful tool. It's not necessarily something everyone needs, but it's really cool technology in my opinion.

    It's been particularly useful for a model orchestration project I've been working on. I have an external emotion classification model driving both the LLM's persona and the TTS output so it stays relatively consistent. The affect system also influences which memories are retrieved; it's more likely to retrieve 'memories' created in the current affect state. IndexTTS2 was pretty much the only TTS that gives the level of control I felt was necessary.

  • Chatterbox-TTS has a MUCH MUCH better output quality though, the quality of the output from Sopro TTS (based on the video embedded on GitHub) is absolutely terrible and completely unusable for any serious application, while Chatterbox has incredible outputs.

    I have an RTX5090, so not exactly what most consumers will have but still accessible, and it's also very fast, around 2 seconds of audio per 1 second of generation.

    Here's an example I just generated (first try, 22 seconds runtime, 14 seconds of generation): https://jumpshare.com/s/Vl92l7Rm0IhiIk0jGors

    Here's another one, 20 seconds of generation, 30 seconds of runtime, which clones a voice from a Youtuber (I don't use it for nefarious reasons, it's just for the demo): https://jumpshare.com/s/Y61duHpqvkmNfKr4hGFs with the original source for the voice: https://www.youtube.com/@ArbitorIan

    • You should try it! I wouldn’t say it’s the best, far from that. But also wouldn’t say it’s terrible. If you have a 5090, then yes, you can run much more powerful models in real time. Chatterbox is a great model though

      2 replies →

    • I've been using Higgs-Audio for a while now as the primary TTS system. How would you say does Chatterbox compare to it if you have experience?

      1 reply →

Super nice! I've been using Kokoro locally, which is 82M parameters and runs (and sounds) amazing! https://huggingface.co/hexgrad/Kokoro-82M

  • I tried Kokoro-JS that I think runs in browser and it was too way too slow with latency also not supporting language I wanted

    • I have a 5070 in my rig. What I'm running is Kokoro in a Python/FastAPI backend - I also use local quantized models (I swap between ministral-3 and Qwen3) as "the brains" (offload to GPT-5.2 inc. web search for "complex" tasks or those requiring the web). In the backend I use Kokoro and generate wav bytes that I send to the frontend. The frontend is just a simple HTML page with a textbox and a button, invoking a `fetch()`. I type, and it responds back in audio. The round-trip time is <1 second for me, unless it needs to call OpenAI API for "complex" tasks. I am yet to integrate STT as well and then the cycle is complete. That's the stack, and not slow at all, but it depends on your HW.

What is "zero-shot" supposed to mean?

  • I believe in this case it means that you do not need to provide other voice samples to get a good clone.

    • It means there is zero training involved in getting from voice sample to voice duplicate. There used to be models that take a voice sample, run 5 or 10 training iterations (which of course takes 10 mins, or a few hours if you have hardware as shitty as mine), and only then duplicate the voice.

      This you give the voice sample as part of the input, and immediately it tries to duplicate the voice.

      1 reply →

Tried english. There are similarities. Really impressive for such budget Also increadibly easy to use, thanks for this

  • But its english-only - so what else could you have tried? Asking because I‘m interested in a german version :)

It's impressive given the constraints!

Would you consider releasing a more capable version that renders with fewer artifacts (and maybe requires a bit more processing power)?

Chatterbox is my go-to, this could be a nice alternative were it capable of high-fidelity results!

  • This is my side “hobby”. And compute is quite expensive. But if the community’s responsive is good, I will definitely think about it! Btw, chatterbox is a great model and inspiration

    • Very cool work, especially for a hobby project.

      Do you have any plans to publish a blog post on how you did that? ?What training data and how much? Your training and ablations methodology, etc.

What does "zero-shot" mean in this context?

  • The *-shot jargon is just in-crowd nonsense that has been meaningless since day one (or zero). Like Big O notation but even more arbitrary (as evidenced by all the answers to your comment).

  • > Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

    https://en.wikipedia.org/wiki/Zero-shot_learning

    edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:

    We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.

    "Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.

    Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.

    • This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.

      19 replies →

    • I think the point is it's not zero shot if a sample is needed. A system that require one sample is usually considered one-shot, or few-shot if it needs few, etc etc.

I don't understand the comments here at all. I played the audio and it sounds absolutely horrible, far worse than computer voices sounded fifteen years ago. Not even the most feeble minded person would mistake that as a human. Am I not hearing the same thing everyone else is hearing? It sounds straight up corrupted to me. Tested in different browsers, no difference.

  • As I said, some reference voices can lead to bad voice quality. But if it sounds that bad, it’s probably not it. Would love to dig into it if you want

    • I agree with the comment above. I have not logged into hacker news in _years_ but did so today just to weigh in here. If people are saying that the audio sounds great, then there is definitely something going on with a subset of users where we are only hearing garbled words with a LOT of distortion. This does not sound like natural speech to met at all. It sounds more like a warped cassette tape. And I do not mean to slight your work at all. I am actually incredibly puzzled here to understand why my perception of this is so radically different from others!

      4 replies →

  • Yes, if this selected piece is the best that was available to be used as a showcase, it's immediately off putting in distortion and mangling of pronunciation.

  • same here, tried few different voices including my kids and my own, the generated audio is not similar at all, it's not even a proper voice

  • Thank you, I was scrolling and scrolling in utter disbelief. It sounds absolutely dreadful. Would drive me nuts to listen to for more than a minute.

This is very cool! And it'll only get better. I do wonder, if, at least as a patch-up job, they could do some light audio processing to remove the raspiness from the voices.

Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

  • Chatterbox TTS does this in “voice cloning” mode but you have to implement the streaming part yourself.

    There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B.

    Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech

  • yes, check out RVC (retrieval voice conversation) which I believe is the only good open source voice changer. Currently there's a bit of a conflict between the original creator and current developers. So don't use the main fork. I think you'll be able to find a more up-to-date fork that's in english.

I just had some amusing results using text with lots of exclamations and turning up the temperature. Good fun.

Impressive! The cloning and voice affect is great. Has a slight warble in the voice on long vowels, but not a huge issue. I'll definitely check it out - we could use voice generation for alerting on one of our projects (no GPUs on hardware).

  • Cool! Yeah the voice quality really depends on the reference audio. Also mess with the parameters. All the feedback is welcome

Very nice to have done this by yourself, locally.

I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)

What could possibly go wrong...

Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?

In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.

I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.

  • I am unhappy about the criminal dimension of voice cloning, too, but there are plenty of use cases.

    e.g. If I could have a (local!) clone of my own voice, I could get lots of wait-on-the-phone chores done by typing on my desktop to VOIP while accomplishing other things.

    • But why do you need it to be a clone of your voice? A generic TTS like Siri or a vocaloid would be sufficient.

  • Yes, you are right. However, there are many upsides to this kind of technology. For example, it can restore the voices of people who were affected by numerous diseases

    • Ok, that's an interesting angle, I had not thought of that, but of course you'd still need a good sample of them from before that happened. Thank you for the explanation.

  • are you under the impression that this is the first such tool? it's not. it's not even the hundredth. this Pandora's box has been opened a long time ago.

I'm sure it has its uses, but for anything with a higher requirement for quality, I think Vibe Voice is the only real OSS cloning option.

F2/E5 are also very good but have plenty of bad runs, you need to keep re-rolling until you get good outputs.

Emm...I played the sample audio and it was...horrible?

How is it voice cloning if even the sample doesn't sound like any human being...

  • I should have posted the reference audio used with the examples. Honestly it doesn’t sound so different from them. Voice cloning can be from a cartoon too, doesn’t have to be from a human being

    • A before / after with the reference and output seems useful to me, and maybe a range from more generic to more recognizable / celebrity voice samples so people can kinda see how it tackles different ones?

      (Prominent politician or actor or somebody with a distinct speaking tone?)

  • Also, I didn’t want to use known voices as the example, so I ended up using generic ones from the datasets

Sorry but the quality is too bad.

I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.

A scammers dream.