Comment by binsquare

5 days ago

Does anyone else find that there's hard to pin down reason of life-lessness in the speech of these voice models?

Especially in the fruit pricing portion of the video for this model. Sounds completely normal but I can immediately tell it is ai. Maybe it's intonation or the overly stable rate of speech?

IMHO it's not lifeless. It's just not overly emotional. I definitely prefer it that way. I do not want the AI to be excited. It feels so contrived.

On the video itself: Interesting, but "ideal" was pronounced wrong in German. For a promotional video, they should have checked that with native speakers. On the other hand its at least honest.

  • I hate with a passion the over-americanized "accent" of chatgpt voices. Give me a bland one any day of the week

    • Yeah that overly fake-excited voice type. Doesn't work for Europe at all. But indeed common in American customer service scenarios.

I think it's because they've crammed vision, audio, multiple voices, prosody control, multiple languages, etc into just 30 billion parameters.

I think ChatGPT has the most lifelike speech with their voice models. They seem to have invested heavily in that area while other labs focused elsewhere.

I'm not convinced its end-to-end multimodal - in that case, you'll have a speech synthesis section and this will be some of the result. You could test by having it sing or do some accents, or have it talk back to you in an accent you give it.

> Sounds completely normal but I can immediately tell it is ai.

Maybe that's a good thing?