Comment by mlboss

6 months ago

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

34 comments

mlboss

seligman99 6 months ago

And a quick video with all of the different voices:

https://www.youtube.com/watch?v=60Dy3zKBGQg

a96 6 months ago

Thanks. I really would not want to listen to any of these regularly.
tracker1 6 months ago

Cool, thanks... aside: the last male voice sounds high/drunk.
Eduard 6 months ago

thank you!

smusamashah 6 months ago

The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.

soasme 6 months ago

Just made a TTS tool based on Kitten TTS, fully browser based, no Python server backend: https://quickeditvideo.com/tts/ A tts model of this size should be industry standard!
Retr0id 6 months ago
The people calling it "OK" probably tried it for themselves. Whatever model is being demoed in that video is not the same as the 25MB model they released.
- darkwater 6 months ago
  
  Nope, looks like the default voice is the worst and it's not in the demo. A Reddit user generated these as well https://limewire.com/d/28CRw#UPuRLynIi7
  
  2 replies →
- fortyseven 6 months ago
  
  It did say this was a preview release, so I'll reserve judgement until that's out the door.
- iab 6 months ago
  
  Local quality is very bad
sergiotapia 6 months ago
https://vocaroo.com/1njz1UwwVHCF
It doesn't sound so good. Excellent technical achievement and it may just improve more and more! But for now I can't use it for consumer facing applications.
- divamgupta 6 months ago
  
  We are still training the model. We expect the quality to go up in the next release. This is just a preview release :)
Mackena 6 months ago

[flagged]

Zardoz84 6 months ago

Sounds very clear. For a non native english speaker like me, it's easy to understand.

tapper 6 months ago

Sounds slow and like something from an anine

ricardobeat 6 months ago
Speech speed is always a tunable parameter and not something intrinsic to the model.
The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.
- dvh 6 months ago
  
  I think he meant overacting typical for English dubs.
  
  2 replies →
numpad0 6 months ago

The only real questions are which Chinese gacha game they ripped data from and whether they used Claude Code or Gemini CLI for Python code. I bet one can get a formant match from output this much overfit to whatever data. This isn't going to stay up for long.

KaiserPro 6 months ago

was it cross trained on futurama voices?

junon 6 months ago

That would be a feature!
archon810 6 months ago
Sounds like Mort from Family Guy.
- divamgupta 6 months ago
  
  Lol
divamgupta 6 months ago

It was not

Aachen 6 months ago

Impressive technical achievement, but in terms of whether I'd use it: oof, that male voice is like one of these fake-excited newsreaders. Like they're always at the edge of their breath. The female one is better but still someone reading out an advertisement for a product they were told they must act extra excited for. I assume this is what the majority of training data was like and not an intentional setting for the demo. Unsure whether I could get used to that

I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech

Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said

Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues

willwade 6 months ago
anyone else wants to try sherpaOnnx you can try this.. https://github.com/willwade/tts-wrapper we recently added in the kokoro models which should sound a lot better. There are a LOT of models to choose from. I have a feeling the Droid app isnt handling cold starts very well.
- spookie 6 months ago
  
  If anyone wants to test ready to install android apks: https://k2-fsa.github.io/sherpa/onnx/tts/apk.html
divamgupta 6 months ago

Thanks a lot for the detailed feedback. We are working on some models which do not use a phonemizer
bornfreddy 6 months ago

RHvoice is pretty good, imho.