Comment by konovalov-nk

1 day ago

In general there aren't really models that can understand nuances of your speech yet. Gemini 2.5 voice mode changed that only recently and I think it can understand emotions but I'm not sure if it can detect things like accent and mispronouncing. The problem is data, we need a large corpus of data labeled how exactly the audio sample is mispronouncing the word, so the model can cluster those. Maybe self-learning techniques without human feedback can do it somehow. Other than that I'm not seeing how this is even possible to train such model with what's currently available.