← Back to context

Comment by wahnfrieden

2 years ago

MeCab with ipadic and a lot of custom swift logic for fixing issue patterns and matching to JMDict entries as an additional heuristic that the stemming/token was done right. I’m also using a custom generated JLPT classification (a more complete guess at what the full set of JLPT vocab is based on ebook word freq) to choose more likely candidate results. I haven’t improved this in a couple years, it’s one of my upcoming focuses now that I have the app rewritten and out.

Unidic also interesting but harder to use and huge data size.

I’m going to be layering on gpt to further improve.

What're you working on?