Comment by sosodev
4 days ago
I think it's because they've crammed vision, audio, multiple voices, prosody control, multiple languages, etc into just 30 billion parameters.
I think ChatGPT has the most lifelike speech with their voice models. They seem to have invested heavily in that area while other labs focused elsewhere.
No comments yet
Contribute on Hacker News ↗