Comment by User23

3 years ago

I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.

2 comments

User23

It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....

User23 3 years ago

Cool, I only skimmed the description maybe I needed to read it more carefully.
Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.