Comment by sillysaurusx
3 years ago
Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.
The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.
The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.
Woof.
I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.
I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.
Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.
This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.
Exact same thing happened to me. Wild.
On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.
Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!
FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.
What does the bolding indicate?
14 replies →
Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)
https://news.ycombinator.com/item?id=17944293
The approach I took was a bit different, but also no ML required.
The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.
It’s a very small space to try to compare so simple methods will work fine.
Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.
It works like a charm for me too.
I put in my username and found my pre-echelon alt, possibilistic.
(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)
I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.
It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....
Cool, I only skimmed the description maybe I needed to read it more carefully.
Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.
sillysaurus3 was in mine. :) Clearly we're not the same.
> sillysaurus3
> sillysaurus2
Tbf a human could have found a bunch of them relatively easily