Comment by masspro

1 day ago

I don't think I can trust TTS for language learning. I could be internalizing wrong pronunciation, and I wouldn't know. One time I tried Duolingo for Japanese already knowing a bit. To their credit I assumed it was recorded clips, until it read 'oyogu' as something like 'oyNHYAOgu', like it concatenated two syllable clips that don't go together. If I didn't already know, would I be trying to study and replicate that nonsense? So I don't know if I could trust TTS audio for language study regardless of what kind of tech it is. Sure mistakes can be unlearned over time spent immersing, but at much more effort than just not internalizing them in the first place.

Also Japanese specifically has this meme where it literally is a pitch-accent language but many people say it's not and teaching resources ignore it. E.g. 'ima' means either 'now' or 'living room' depending if syllable #2 is higher or lower. Clearly only applies to some languages, but is another dimension even harder to a learner to know there's a mistake. I have to imagine even other Latin languages probably have reading quirks where this could happen to me.

Minimax's new model is quite good. We use their voices for some of our Japanese tutors. The pitch accent is almost perfect.

There are incorrect reading or Chinese readings occasionally, but you can tell when that happens due to the furigana being different

  • If you have the correct furigana, you could even detect when the TTS model picked the wrong reading and regenerate.

    But how do you know the furigana are correct? Unless you start out fully human-annotated text, you need some automated procedure to add furigana, which pushes the problem from "TTS AI picked the wrong reading" to "furigana AI picked the wrong reading."

    • Yes it pushes the problem, but it's a much easier problem, and models like Gemini flash 2.5 do very well.

Yeah Japanese TTS is a lot harder than it looks. I’m also building a language learning application, and constantly ran into incorrect readings. Eleven labs, eleven labs v3, OpenAI, play.ht, azure, google, Polly — I’ve tried them all. They are all really bad (more than 1/3 the expressions had an error in them somewhere).

It _is_ fixable though. It took me about a week, but I have yet to find a mistaken reading now. This also seems to just be the case with Japanese - most tonal languages seem to have the correct tones (I’m not qualified to comment on how natural the tones sound, but I have yet to find a mismatch like in Japanese)

Yes. AI transcription is great, AI translation is OK (depending on language pair), but TTS is still pretty awful for most languages.

Also a Japanese learner here—albeit a beginner. As I understand it, the pitch accent is about stress, languages can stress a syllable with length, volume, pitch, etc. Spanish uses vowel length, Icelandic uses volume, English uses a combination of length and volume, and Swedish (just like Japanese) uses pitch. Just like in English if you put the wrong stress on the word it can range anything from sounding foreign to being incomprehensible. (Aside: I always remember trying to say the name of the band Duran Duran to an English speaker, while putting the stress on the first syllable like is normal in Icelandic, but my listener had no idea what I was saying, it took probably 30 attempts before I was corrected with the correct stress).

I think Japanese is somewhat special though for a large number of homonyms (i.e. words that are spelled the same) so speaking with the correct pitch becomes somewhat more important.

  • Somewhat more important, but as someone with decent Japanese who knows about pitch accent but can barely hear the difference in real time, and never actively learned it except for the few well known examples like bridge/chopstick, I don't think it matters all that much. Yes, you'll sound foreign. But you'll be understood nevertheless, in the vast majority of cases.