← Back to context

Comment by jsnell

3 years ago

After a few tries on boring accounts, I thought to try the account of somebody who was notorious for an incident outside of HN, and had a (deservedly) bad time at HN for a couple of years before the account went dark.

And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).

Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.

I keep no alternate accounts, but this tool reports best matches for me that appear to be Slavic or just Russian - and I am Russian. Best match score in my list is just above 0.5. There are some clearly alternate accounts on the list, their match scores with this tool are well above 0.7.

It is probable that persons of same cultural origin will have similar writing style and vocabulary. It is also probable that persons of same cultural origin would have same relationships with the world as a whole, they would like same things and dislike other same things.

So, in my opinion, it is possible that you have found not only alternate accounts (score above 0.7), but accounts of people with same cultural origin (ones that are around 0.6).

  • My highest was 0.41 and the person writes nothing like me. I guess I'm a unique snowflake after all.

    • I was curious about this, my highest match was 0.47 and I have no alts, maybe I'm also a unique snowflake, or haven't said anything noteworthy enough to have been deepfaked yet ;).

    • I have a few in the low 0.5's and, honestly, they seem cool and I want to meet them.

  • I don't have any alternate accounts here either and my writing style is apparently nearly the same as a high profile account that I recognize and has many points. I wouldn't say this is a highly accurate thing.

  • There're 19 other accounts this tool finds similar to me. Those are not my accounts. 0.46 - 0.56 are numbers.

    • I think people are sort of confused at what this tool is supposed to be which I will concede is partially my fault. The results of this tool are by themselves not indicative of having an alternative account. It generates the 20 most similar users for every single user on the site, regardless of whether they have an alt or not (there's obviously no way for me to know that for every single user). In your case further investigation would reveal that none of those accounts are yours.

      2 replies →

    • Fwiw, and as gp mentioned, > 0.7 seems more likely to be alt territory.

    • You are fools, one and all! This tool's only purpose, is to tag people who use it!

      Now they know just who cares about which alternate accounts. They know!

      They freaking know, man!

      You have all fallen for their ploy. Fools!

      3 replies →

.6 is high confidence? I did my own username, wondering what it would return, since I know I don’t have any alt accounts. The top results are in the .6-.7 range. If they aren’t alt accounts, is it just coincidence that we have similar writing styles?

  • I think so.

    A funny thought — my “matches” cap out at around .56. Having false positives* in a tool like this might feel like a “bad result” but actually I think it just means that if someone were running this sort of tool across the whole internet, I’d be relatively easy to correlate, while your identity would be intermingled with your .6-.7 partners.

    *actually they aren’t really even false positives because the tool doesn’t promise to detect alts in the first place, just find similar styles.

> but this functionality turned out way creepier than I thought the moment I tried it

Hopefully this raised awareness means that people who actually need anonymity will be more likely to know to take precautions.

  • Genuinely asking, what way is there to combat this? Is there a tool that takes out stylistic elements of your comment?

    • This is the million dollar question. I think the goal of "anonymity for most intents and purposes" is worthy, it's been how I've enjoyed HN and Reddit, but I also know that it was just a matter of time before stylometry and other meta-analysis of post history become 10 second tools for everyone. Now the cat is out of the box.

      I've been thinking about this a bit, and I've landed in that having a stable identifier across ALL comments & posts is a poor default. We still probably want some coherence, at minimum within a thread, eg to follow a back-and-forth. The site itself may also use stable identifier for abuse prevention. But there's no reason one should have the same username externally traceable for posts about completely different topics.

      In practice, this could be done with low friction pseudonym creation, which all ties to the same account privately.

    • One way would be to run such tool before posting and then based on the results, tweak the post and repeat until the similarities are not statistically significant. Or instead of tweaking, start posting under a new throwaway account. But this won't save you when some new way to analyze style appears in the future. Moreover there are other types of meta data which can be taken into account to narrow down the search space a bit such as timestamps. And obviously more you write, harder it is to control these things.

0.6 isn't much. I have 3 matches above 0.6, and they're not me. 20 or so over 0.5.

  • I get one 0.68 match, which... fair enough. It is an account I've abandoned some years ago, no secrets there.

    No other hits above 0.5, so I guess that either makes me pretty unique as a commentator or my English is broken in a unique way.

  • That's why you manually evaluate the matches. And like I wrote in that comment, I did that manual eval, and these clearly are alts of that main account, not spurious. Narrowing down the pool of accounts you'd need to do this kind of manual evals for by a factor of 100000 is a pretty significant change in capabilities.

Could you elaborate on why it's obvious why you won't name the account?

  • Maybe to avoid attracting any extra attention to this user? Also, as someone who’s read HN for a few years, it only took me 2 guesses to find an account that the above comment describes (and not necessarily the same person).

    • It was a classy move by jsnell, too. Thank you.

      (I don’t know who the comment is talking about, which is how it should be. There’s no need to blow someone’s cover in a highly visible way. Even if they were satan, they’d still be welcome on HN as long as they’re writing substantive, interesting comments that follow the guidelines.)

      1 reply →

  • They obviously don't want it to be known, seeing as they've got alts to post under and avoid going into too much detail. Being able to go out and do your own research is different than posting the information open for everyone to see at a glance.

    I would say it's obvious why one might respect that wish (do unto others...), but I'm also aware that my and my culture's sense of privacy goes further than many others'.

MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The top one seems an odd one out in that case?

  • Usernames aren't random enough to be safe as a simple MD5. Perhaps with a strong bcrypt, but similar to PIN codes, it might be better to give partial information like "is the second character an ...", assuming nobody else made similar statements. Or give the first ~two hex characters of the hash, so that it would match 1/(16²)rd of the usernames. I'm sure there's also a clever way for a zero-knowledge proof here, probably something with diffie-hellman using the name as your random integer or something, but I'm too sick to think about this stuff right now. Privately sharing data publicly is hard.

    • Another problem is that it's a small set. If you had a list of all HN users, you could compute md5 for all of them in seconds.

    • I think the intention of the post not mentioning the handle was just to prevent old discussions from flaring up or so? The post doesn't really contain any new information on the person that would be worth obscuring. So I just thought I'd hash it to prevent that. But it seems I actually screwed up the hashing so I will leave it at that.

> quick browse of the comments of the recently active ones, they look really likely to be alts.

Hmm isn't a spot check of comments somewhat tautological, since that is how the tool identifies alts (rather than something like IP address or time of day)? If this had been promoted as "find accounts with similar writing style to yours" would people immediately assume alts?

  • I would presume that OP is referring to the actual content of the comments. This just does stylometric analysis, which looks at word choice, but not what the arrangement of the words mean.

    If some accounts are found to be stylometrically similar, and then a visual inspection also shows them all stating similar opinions, that latter piece of data is a strong signal.