Comment by TheAceOfHearts

8 months ago

Unfortunately it's not usable if you're GPU-poor. Couldn't figure out how to run this with an old 1080. I tried VibeVoice-1.5B on my old CPU with torch.float32 and it took 832 seconds to generate a 66 second audio clip. Switching from torch.bfloat16 also introduced some weird sound artifacts in the audio output. If you're GPU-poor the best TTS model I've tried so far is Kokoro.

Someone else mentioned in this thread that you cannot add annotations to the text to control the output. I think for these models to really level up there will have to be an intermediate step that takes your regular text as input and it generates an annotated output, which can be passed to the TTS model. That would give users way more control over the final output, since they would be able to inspect and tweak any details instead of expecting the model to get everything correctly in a single pass.

8 comments

TheAceOfHearts

tempodox 8 months ago

This is ludicrous. macOS has had text-to-speech for ages with acceptable quality, and they never needed energy- and compute-expensive models for it. And it reacts instantly, not after ridiculous delays. I cannot believe this hype about “AI”, it’s just too absurd.

NitpickLawyer 8 months ago
> with acceptable quality
Compared to IBMs Steven Hawking's chair, maybe. But apple tts is not acceptable quality in any modern understanding of SotA, IMO.
- selkin 8 months ago
  
  Different use cases:
  If you need a not-visual output of text, SoyA is a waste of electrons.
  If you want to try and mimic a human speaker, then it ain’t.
  Question is why would you need to have the computer sound more human, except for “because I can”.
  
  5 replies →