Comment by nowittyusername

1 month ago

I have been working on playing around with over 10 stt systems in last 25 days and its really weird to read this article as my experience is the opposite. Stt models are amazing today. They are stupid fast, sound great and very simple to implement as huggingface spaces code is readily available for any model. Whats funny is that the model he was talking about "supertonic" was exactly the model I would have recommended if people wanted to see how amazing the tech has become. The model is tiny, runs 55x real time on any potato and sounds amazing. Also I think he is implementing his models wrong. As he mentions that some models don't have streaming and you have to wait for the whole chunk to be processed. But that's not a limit in any meaningful way as you can define the chunk. You can simply make the first n characters within the first sentence be the chunk and process that first and play that immediately while the rest of the text is being processed. ttfs and ttfa on all modern day models is well below 0.5 and for supertonic it was 0.05 with my tests.....

What's your experience at high speeds, with garbled speech artifacts and pronouncation accuracy?

  • With supertonic , or overall? If overall most do pretty well though some are funky, like suprano was so bad no matter what I did, so i had to rule that out from my top contenders on anything. supertonic was close to my number one choice for my agentic pipeline as it was soo insanely fast and quality was great, but it didnt have the other bells and whistles like some other models so i held that off for cpu only projects in the future. If you are gonna use it on a GPU I would suggest chatterbox or pocket tts. Chatterbox is my top contender as of now because it sounds amazing, has cloning and i got it down to 0.26 ttfa/ttsa once i quantized it and implemented pipecat in to it. pocket tts is probably my second choice for similar reasons.

>Also I think he is implementing his models wrong.

This is something I've noticed around a lot of AI related stuff. You really can't take any one article on it as definitive. This, and anything that doesn't publish how they fully implemented it is suspect. That's both for the affirmative and negative findings.

It reminds me a bit of the earlier days of the internet were there was a lot of exploration of ideas occurring, but quite often the implementation and testing of those ideas left much to be desired.

Minor nitpick, but you mean "tts" not "stt" both times.

Is supertonic the best sounding model, or is there a different one you'd recommend that doesn't perform as well but sounds even better?

  • yes sorry i mixed these up. supertonic is not the best sounding in my tests. it was by far the fastest, but its audio quality for something so fast was decent. if you wanted something that sounds better AND is also extremely fast pocket tts is the choice. amazing quality and also crazy fast on both gpu and cpu. if you care mainly about quality, chatterbox in my tests was best fit, but its slower then the others. qwen 3 tts was also great but its unisable as any real time agentic voice as its too slow. they havent relesed the code for streaming yet, once they release that this will be my top contender.

Are you using them at 1000 wpm?

  • Supertonic is probably way faster then that, I wouldn't be surprised if measured it would be something like 14k wpm. On my 4090 I was getting about 175x real time while on cpu only it was 55x realtime. I stopped optimizing it but im sure it could be pushed further. Anyways you should check out their repo to test it yourself its crazy what that team accomplished!