Comment by jmalicki

2 hours ago

I'm not sure what the state of the art is today, but 15 years ago I worked on a cross-lingual search engine - a challenge with Chinese was that ngram-like models for detecting common language errors (such as typos) were simply ineffective due to this.

We found a lot of gain by having ranking features based on Pinyin to detect typos/misspellings due to homophones (and similar sounding words). I was investigating stroke decomposition to try to be able to detect near homographs, but wasn't able to find any good libraries at the time.

I could imagine the homophone issue is especially relevant for spoken input to LLMs. LLMs are good enough that they're usually right, so it's probably less of an issue, but in English I can have crazy typos and everything just works, I am curious how well that would work for Chinese, since I suspect it's a harder problem by far due to the lack of subword tokens?