Comment by leonidasv

1 day ago

Question: ChatGPT voice mode seems to have too much tolerance for mispronouncing. Sometimes, it understands you even you mispronounce something in a phrase, and it's not aware enough to correct you - it even says your pronunciation is correct if asked. It's good at grammar, though.

It makes me think the audio goes through a kind of voice-to-text model before the answer, so nuance is lost; or the model wasn't trained to distinguish between correct and incorrect pronunciations.

Does Issen have this issue too? Pronunciation vices are common when you're learning a new language.

In general there aren't really models that can understand nuances of your speech yet. Gemini 2.5 voice mode changed that only recently and I think it can understand emotions but I'm not sure if it can detect things like accent and mispronouncing. The problem is data, we need a large corpus of data labeled how exactly the audio sample is mispronouncing the word, so the model can cluster those. Maybe self-learning techniques without human feedback can do it somehow. Other than that I'm not seeing how this is even possible to train such model with what's currently available.

Yes we do have this issue, but it's improved a bit over chatgpt due to using multiple transcribers.

The models are improving though, and they are at a very good place for English at the moment. I expect by next year we will switch over to full voice to voice models.

  • This reply seems to miss the question, or at least doesn’t answer it clearly. Is this service overly tolerant of mispronunciations? Foundational models are becoming more tolerant, not less, over time which is the opposite of what I’d want in this case.

    • It's less tolerant of mispronunciations. There is custom promting to explicitly leave in mistakes and to not fix them. It's still not perfect and it (the speech to text module) sometimes corrects the user's pronunciation mistakes.