Comment by User23

3 years ago

I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.

It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....

  • Cool, I only skimmed the description maybe I needed to read it more carefully.

    Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.