Comment by sillysaurusx

3 years ago

Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.

The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.

The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.

31 comments

sillysaurusx

hnburnerUixoHr5 3 years ago

Woof.

I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.

I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.

Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.

butterNaN 3 years ago

This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.
hailwren 3 years ago

Exact same thing happened to me. Wild.

dimmke 3 years ago

On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.

costco 3 years ago

Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!

sillysaurusx 3 years ago
FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.
- justusthane 3 years ago
  
  What does the bolding indicate?
  
  14 replies →

lettergram 3 years ago

Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)

https://news.ycombinator.com/item?id=17944293

The approach I took was a bit different, but also no ML required.

The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.

It’s a very small space to try to compare so simple methods will work fine.

costco 3 years ago

Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.

echelon 3 years ago

It works like a charm for me too.

I put in my username and found my pre-echelon alt, possibilistic.

(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)

User23 3 years ago

I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.

costco 3 years ago
It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....
- User23 3 years ago
  
  Cool, I only skimmed the description maybe I needed to read it more carefully.
  Have you considered doing rune rather than word ngrams? I can imagine that might be prohibitively expensive, but I really don’t know. I did something like that long long ago in C for automatic document language detection. It was quite accurate.

bb88 3 years ago

sillysaurus3 was in mine. :) Clearly we're not the same.

FormerBandmate 3 years ago

> sillysaurus3

> sillysaurus2

Tbf a human could have found a bunch of them relatively easily