Comment by yorwba
2 years ago
> We should implement a real Chinese lemmatizer there to chunk the words.
Or find all substrings that are listed in a dictionary (≈everyone uses cc-cedict https://www.mdbg.net/chinese/dictionary?page=cc-cedict ) and give translations for all of them. That way, the user won't be limited to any particular chunking granularity, which is always a finicky aspect of word segmenters to fine-tune.
No comments yet
Contribute on Hacker News ↗