Show HN: Using stylometry to find HN users with alternate accounts
3 years ago (stylometry.net)
Author here. This site lets you put in a username and get the users with the most similar writing style to that user. It confirmed several users who I suspected were alts and after informally asking around has identified abandoned accounts of people I know from many years ago. I made this site mostly to show how easy this is and how it can erode online privacy. If some guy with a little bit of Python, and $8 to rent a decent dedicated server for a day can make this, imagine what a company with millions of dollars and a couple dozen PhD linguists could do.
Here's Paul Graham:
https://stylometry.net/user?username=pg
Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)
Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.
The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.
The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.
Woof.
I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.
I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.
Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.
This makes me melancholic. One should be able to express themselves without the overhead of privacy concerns.
Exact same thing happened to me. Wild.
On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.
Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!
FWIW, top 20 was necessary for mine. The bolding was a brilliant move. Several of my accounts were ranked 10-20, but popped out due to the bolding.
15 replies →
Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)
https://news.ycombinator.com/item?id=17944293
The approach I took was a bit different, but also no ML required.
The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.
It’s a very small space to try to compare so simple methods will work fine.
Exactly. HN emphasizes long-form posts much more than other forums which makes the commenters here very susceptible to this kind of analysis. Plus you can fit every single HN comment in RAM on a mid tier gaming laptop so it's even easier. I was trying to think of applications of this kind of data and the only thing I could think of was moderation tools/detecting ban evaders but what you've done seems much more profitable lol.
It works like a charm for me too.
I put in my username and found my pre-echelon alt, possibilistic.
(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)
I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.
It is. The description on the about page is a little simplified but I basically I look at the most common word and character ngrams of size 1,2,3 (200 each), put all the frequencies in an array and then compare to all the other users with https://scikit-learn.org/stable/modules/generated/sklearn.me....
1 reply →
sillysaurus3 was in mine. :) Clearly we're not the same.
> sillysaurus3
> sillysaurus2
Tbf a human could have found a bunch of them relatively easily
The method used, i.e. to calculate the cosine of the two authors' word vectors, is poorly suited for stylometric analysis because it is based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.
Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).
Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).
This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...
> based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.
Interesting. I was expecting to be grouped with other Russian speakers and I am (based on some nicknames). But I thought the most telling feature will be exactly word order - it’s absolutely relaxed in Russian. Word frequencies? Well, probably the absence of articles, lol (but I swear to God that I often spend some extra time trying to insert as many articles in my texts as I could).
There’s https://en.wikipedia.org/wiki/Idiolect :
”Language consists of sentence constructs, choice of words, and expression of style. Accordingly, an idiolect is an individual's personal use of these facets. Every person has a unique idiolect influenced by their language, socioeconomic status, and geographical location.”
In practice a more complex approach will tend to require a greater amount of data per user, so in this specific case this simple approach is not too bad. Moreover, fake accounts are likely to talk about the same topics, so while this leads to false positives, also makes it more likely that in the list we find actual duplicates.
Ha, gruseom shows up for pg, which is dang’s old account. A worthy successor.
This is a fascinating way to find similar HN users who aren’t the same person. It’s a surprisingly great recommendation engine. “If you like pg, you might also like…”
Sure, the privacy concerns are valid, but the cat’s out of the boot. Might as well enjoy the benefits.
montrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567
Nicely done. One of the best hacks I’ve seen in a long time.
> motrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567
I had this hunch too. It's either pg or someone trying really hard to be pg.
I mean, this is HN -
> someone trying really hard to be pg
describes half the site.
> Someone who talks about ancient history, Occam’s razor, VCs and startups,
I think these are all common topics among HN readers and commenters.
Why would montrose be pg ? The correlation is not that high. Looks like a few people have picked up pg's mannerisms.
Yeah, that score is only slightly higher than the highest one it shows for my account (which is also bold) - and unless my alter ego has been disguised so well it even managed to hide from myself, I'm pretty sure that isn't me :)
1 reply →
There are factors that make me think it is more likely than not (just scrolled through the comment history, don't feel like linking everything) that he is pg.
- Is bolded on pg's page
- Mentions yoga
- Talks about Lisp often
- Talks about YC often
- Talks about kids
- Links to Paul Graham's website
- Says he uses vi
- Writes exactly like you would expect pg to write
6 replies →
> but the cat’s out of the boot
It's my first time hearing that variant. Usually its, "the cat's out of the bag" where I'm from.
Do you mean boot in the UK sense, what Americans would call the trunk of a car? Or do you mean a sturdy piece of footwear?
Obligatory xkcd https://xkcd.com/2390/
It’s a little writing trick I leaned from (I think) Orwell. Any time you’re about to use a common metaphor, try to tweak it. You’ll catch readers off guard, which piques their curiosity.
It’s a fun game, too. I wish I’d used “the cat’s out of the hat,” but I didn’t think of it till later.
13 replies →
There's a popular movie called "Puss in Boots". That's what I had to think of first.
1 reply →
This is somewhat similar to how they ended up catching the Unabomber. The FBI were literally at a dead end. They ended up posting one of his letters/manifestos in the paper, somebody recognised a turn of phrase the unabomber used that was unusual and reported it as possibly being their brother, FBI investigated the lead and it lead them straight to him.
Excerpts from wiki:
> Before the publication of Industrial Society and Its Future, Kaczynski's brother, David, was encouraged by his wife to follow up on suspicions that Ted was the Unabomber.[91] David was dismissive at first, but he took the likelihood more seriously after reading the manifesto a week after it was published in September 1995. He searched through old family papers and found letters dating to the 1970s that Ted had sent to newspapers to protest the abuses of technology using phrasing similar to that in the manifesto.[92]
> In early 1996, an investigator working with Bisceglie contacted former FBI hostage negotiator and criminal profiler Clinton R. Van Zandt. Bisceglie asked him to compare the manifesto to typewritten copies of handwritten letters David had received from his brother. Van Zandt's initial analysis determined that there was better than a 60 percent chance that the same person had written the manifesto, which had been in public circulation for half a year. Van Zandt's second analytical team determined a higher likelihood. He recommended Bisceglie's client contact the FBI immediately.[96]
> In February 1996, Bisceglie gave a copy of the 1971 essay written by Ted Kaczynski to Molly Flynn at the FBI.[87] She forwarded the essay to the San Francisco-based task force. FBI profiler James R. Fitzgerald[98][99] recognized similarities in the writings using linguistic analysis and determined that the author of the essays and the manifesto was almost certainly the same person. Combined with facts gleaned from the bombings and Kaczynski's life, the analysis provided the basis for an affidavit signed by Terry Turchie, the head of the entire investigation, in support of the application for a search warrant.[87]
https://en.m.wikipedia.org/wiki/Ted_Kaczynski
As I recall, one of the clinchers was his use of the phrase, "you can’t eat your cake and have it too" as opposed to the now-predominant variant "you can’t have your cake and eat it too."
I often wonder if stylometry can be used to positively identify a person based not on general word frequency, but by a single phrase or two which are rare in general but commonly used by the individual. In theory this could be relatively easy to find given a large corpus. You'd pick out the top few n-grams for short phrases by an individual and identify the ones which are most overly-represented compared to the rest of the population.
It was actually his brother.
So is the lesson you should have GPT rewrite your manifesto so as to obscure your personal idioms?
Or something purpose-built like Anonymouth (https://github.com/psal/anonymouth), although it seems to be both unique and dead.
Also interesting:
> Ross Ulbricht aka Dread Pirate Roberts, the mastermind behind the infamous Silk Road site which served as a black market for drugs, weapons and fake documents was also well aware of the potential danger of stylometry being used against him. At the time of his arrest in a San Francisco public library, the FBI captured images of his laptop screen as evidence. Guess what what he had bookmarked — “Science of Stylometry.”
https://medium.com/svilenk/the-case-for-anonymity-12db114f0c...
2 replies →
Only if you have a history of sending crazed writings/manifestos to newspapers and family.
The show “Manhunt: Unabomber” (Netflix) shows this whole story very well.
This is a super interesting tool for self reflection. Looking at the top 10 similar accounts to mine, it gives me an arms-length view of how other people probably interpret my tone.
I appear to be a well-educated, over-confident know-it-all.
My #3 match is cstross, and now I’m convinced that my life-long secret dream of being a successful sci-fi novelist is basically a matter of typing. (Ideas? Character development? Ruthless editing? Developing an audience? Having a publisher? What do I need of those when the Computer told me I’m practically a genius…)
I'd suggest giving the back story to Agent to the Stars by John Scalzi a glance.
http://www.scalzi.com/agent/
> In the summer of 1997, I was 28 years old, and I decided that after years of thinking about writing a novel, I was simply going to go ahead and write one. There were two motivations for doing so. First, I was simply curious if I could; I'd had up to that time a reasonably successful life as a writer, but I'd never written anything longer than ten pages in my life outside of a classroom setting. Two, my ten-year high school reunion was coming up, and I wanted to be able to say I'd finished a novel just in case anyone asked (they didn't, the bastards).
> In sitting down to write the novel, I decided to make it easy on myself. I decided first that I wasn't going to try to write something near and dear to my heart, just a fun story. That way, if I screwed it up (which was a real possibility), it wasn't like I was screwing up the One Story That Mattered To Me. I decided also that the goal of writing the novel was the actual writing of it -- not the selling of it, which is usually the goal of a novelist. I didn't want to worry about whether it was good enough to sell; I just wanted to have the experience of writing a story over the length of a novel, and see what I thought about it. Not every writer is a novelist; I wanted to see if I was.
Same. Looking through some of the handles on my list tells me that I come across like a not-particularly-well-educated McSmug that needs to take a good long look at myself. Wouldn’t be so bad if I wasn’t reading the posts thinking I definitely could see myself writing this.
This was certainly eye-opening.
Update: It’s actually a little strange that reading through some of the matches it’s not just style that overlaps but perspectives in quite a few cases too. I’m definitely not the unique little snowflake that some others are finding themselves to be.
I also enjoyed reading one of my style-partner’s posts.
The most noticeable similarity is that we both clearly have strong opinions about some things, and like to share information, but also like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.
The downside is, I guess, this could be seen as a bit weasel-word-y or indirect.
> like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.
Commonly called just “hedging” like hedging your bets.
3 replies →
> I appear to be a well-educated, over-confident know-it-all.
Don't we all?
I hate us insufferable nerds. !
> over-confident know-it-all.
I’m pretty sure participation in HN is a 99% sure filter for being called this many times in one’s life.
That's what we all come to HN for...
we must be a good match
I'd love a version of this where you enter two usernames and get a match score.
After a few tries on boring accounts, I thought to try the account of somebody who was notorious for an incident outside of HN, and had a (deservedly) bad time at HN for a couple of years before the account went dark.
And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).
Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.
I keep no alternate accounts, but this tool reports best matches for me that appear to be Slavic or just Russian - and I am Russian. Best match score in my list is just above 0.5. There are some clearly alternate accounts on the list, their match scores with this tool are well above 0.7.
It is probable that persons of same cultural origin will have similar writing style and vocabulary. It is also probable that persons of same cultural origin would have same relationships with the world as a whole, they would like same things and dislike other same things.
So, in my opinion, it is possible that you have found not only alternate accounts (score above 0.7), but accounts of people with same cultural origin (ones that are around 0.6).
My highest was 0.41 and the person writes nothing like me. I guess I'm a unique snowflake after all.
5 replies →
I don't have any alternate accounts here either and my writing style is apparently nearly the same as a high profile account that I recognize and has many points. I wouldn't say this is a highly accurate thing.
There're 19 other accounts this tool finds similar to me. Those are not my accounts. 0.46 - 0.56 are numbers.
8 replies →
.6 is high confidence? I did my own username, wondering what it would return, since I know I don’t have any alt accounts. The top results are in the .6-.7 range. If they aren’t alt accounts, is it just coincidence that we have similar writing styles?
I think so.
A funny thought — my “matches” cap out at around .56. Having false positives* in a tool like this might feel like a “bad result” but actually I think it just means that if someone were running this sort of tool across the whole internet, I’d be relatively easy to correlate, while your identity would be intermingled with your .6-.7 partners.
*actually they aren’t really even false positives because the tool doesn’t promise to detect alts in the first place, just find similar styles.
> but this functionality turned out way creepier than I thought the moment I tried it
Hopefully this raised awareness means that people who actually need anonymity will be more likely to know to take precautions.
Genuinely asking, what way is there to combat this? Is there a tool that takes out stylistic elements of your comment?
4 replies →
1 reply →
0.6 isn't much. I have 3 matches above 0.6, and they're not me. 20 or so over 0.5.
I get one 0.68 match, which... fair enough. It is an account I've abandoned some years ago, no secrets there.
No other hits above 0.5, so I guess that either makes me pretty unique as a commentator or my English is broken in a unique way.
That's why you manually evaluate the matches. And like I wrote in that comment, I did that manual eval, and these clearly are alts of that main account, not spurious. Narrowing down the pool of accounts you'd need to do this kind of manual evals for by a factor of 100000 is a pretty significant change in capabilities.
Could you elaborate on why it's obvious why you won't name the account?
Maybe to avoid attracting any extra attention to this user? Also, as someone who’s read HN for a few years, it only took me 2 guesses to find an account that the above comment describes (and not necessarily the same person).
3 replies →
They obviously don't want it to be known, seeing as they've got alts to post under and avoid going into too much detail. Being able to go out and do your own research is different than posting the information open for everyone to see at a glance.
I would say it's obvious why one might respect that wish (do unto others...), but I'm also aware that my and my culture's sense of privacy goes further than many others'.
MD5 of the username is 9abc27e93b7e3c04b7c599017c1cfe5f ? The top one seems an odd one out in that case?
Usernames aren't random enough to be safe as a simple MD5. Perhaps with a strong bcrypt, but similar to PIN codes, it might be better to give partial information like "is the second character an ...", assuming nobody else made similar statements. Or give the first ~two hex characters of the hash, so that it would match 1/(16²)rd of the usernames. I'm sure there's also a clever way for a zero-knowledge proof here, probably something with diffie-hellman using the name as your random integer or something, but I'm too sick to think about this stuff right now. Privately sharing data publicly is hard.
9 replies →
> quick browse of the comments of the recently active ones, they look really likely to be alts.
Hmm isn't a spot check of comments somewhat tautological, since that is how the tool identifies alts (rather than something like IP address or time of day)? If this had been promoted as "find accounts with similar writing style to yours" would people immediately assume alts?
I would presume that OP is referring to the actual content of the comments. This just does stylometric analysis, which looks at word choice, but not what the arrangement of the words mean.
If some accounts are found to be stylometrically similar, and then a visual inspection also shows them all stating similar opinions, that latter piece of data is a strong signal.
It would be nice to make the names clickable.
I don't think the list of pg alternate account is accurate. I checked a few. They have many oneliners that is typical of pg, but the topics and style don't look similar.
I searched a few more and got better results. :)
I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.
> I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.
It's based purely off frequency of the 200 most common English 1 word phrases, 2 word phrases, 3 word phrases, 1 character sequences, 2 character sequences, and 3 character sequences. Topic does not really have anything to do with it. If I had more time I probably would've done a smarter model that accounted for things like that.
One is also a mathematician. It's trivial that we overuse some technical words even if it's unnecessary.
Another is form Argentina, so I guess the native language leaks, for example using words derived from latin that are not idiomatic.
And there are a few more, that is a honor to be "confused" with, but I have no clue why.
Cool stuff, thank you for sharing your findings!
I don't do throwaway. I either post or STFU. I also STFU on darknet. Its why I found it fun to read/lurk on things like I2P back when it was new. And I know that on a pseudonymous account it is only a matter of time until it can be linked to another pseudonymous account. It would not surprise me if stylometry was used on Dread Pirate Roberts or the people behind The Pirate Bay or the people behind Wikileaks (Assange's sockpuppet accounts). Such can also have been used to verify afterwards instead of beforehand. Though with TPB since it was on clearweb an advanced adversary could have used correlation/timing attack to figure who wrote what.
I'm having fun times recognizing other Dutch people though their usage of English language. For example, a distinctive word I see Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a red flag the person is native Dutch. I wonder if there are stylometry tools available for figuring if someone used physical vs touchscreen keyboard (I used Glider to write this post, spellchecker unavailable).
And yes, organizations like secret service and police should use such tools as well. It is a known tool, why not use it for good? As with any tool, it can be used for good and evil. On HN this could be useful for the mod team (AFAIK nowadays only dang) to find banned people's sockpuppets. Cross-community could also be a fun project: find a HN user's Twitter or Reddit account. And I hope this method is also used to find Russian trolls on social media.
Most people greatly underestimate the power of linkage attacks on anonymity. And it doesn't even take fancy ML. In the context of healthcare records, I like to trot out this 25 year old example of an MIT grad student and the then-governor of MA.
https://ischoolonline.berkeley.edu/blog/anonymous-data/
The top hit on my list looked familiar. I looked at their recent comments and saw a discussion between that user and me. We were quoting eachother directly throughout.
I wonder if this explains our similarity. And if so, could we tweak the algo by e.g. Removing text that is prepended with ”>”
The scary thing is that once you have this data, finding HN matches for individual targeted users on other sites becomes trivial, even if those sites are harder to scrape. I bet most people here have an anonymous Reddit account, for example. If you wanted to know who was behind a particular Reddit account, you could feed it into something like this and compare the results with HN, where accounts are less likely to be anonymous. Or build a database based on blogs, Github comments, etc.
Also, since this uses only word frequency, there are probably relatively easy improvements to make that would make it even more powerful, like looking at particular runs of words that are unique. Some expressions or figurative language only show up in combinations of words, and tend to be highly style specific.
I could have used a part of speech tagger, looked at time of day a user posts, capitalization, spelling errors, etc. From what I understand the state of the art is lightyears ahead of this, there are even companies with actual linguists who will act as expert witnesses in court to say stuff like "we can say with 95% certainty that xyz authored this email." Honestly it's kind of scary. There are papers that talk about cross platform authorship attribution, one I think did it with Twitter, Blogspot, G+ and had pretty good results.
Thus proving the only actually anonymous community in practice is 4chan, and that’s why it’s so toxic.
If you define “toxic” as “people disagreeing with you”, sure. That was what the entire internet was like until maybe 2005.
2 replies →
Forget the alternate accounts — if two users are close in style, there’s a decent chance they should be friends. This is an HN friendship machine.
It would be convenient if the usernames linked to the comment pages on Hacker News (to avoid having to copy/paste and URL hack, which is made even slightly more annoying because for some reason when I tap and hold the usernames to copy them your markup--I haven't looked at why yet--is causing an extra space character to get copied on the left).
This is interesting.
I'm 0.566 correlated with logfromblammo -- and while we are definitely not the same person, I could easily imagine writing a sentence such as:
"For some bizarre reason, management has not yet assigned a task to their programmer underlings to automated themselves out of existence. I can't imagine why."
which is theirs, not mine, from about a year ago. I like that.
On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.
> On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.
This is due to the Firebase API not updating when users ask the admins to move their comments to another account.
Yeah, I got a good match with my previous nick here. Which to me proves the tool works well.
I had a similar experience finding my most likely alt (.50 suggesting I am a unique snowflake as I have always thought :-), my most likely alt is writing certainly in a style I appreciate and on subjects I often mention.
How about this for countermeasure:
As you're typing out a comment the software gives you a list of accounts you're becoming similar to. That way you can adjust your writing as you type.
Someone linked it in the thread: https://github.com/psal/anonymouth
Forget countermeasures, go covert. Write a comment, have the comment be rewritten before submission in order to resemble a targeted account.
Sounds great, except there are many different similarity measures. Which one does the algorithm use?
Why not all of them? Which metrics are closer would tell you which aspects of your writing you need to focus on.
This found an alt that I created specifically to see if I could write artificially to defeat this kind of analysis. I have seen other tools like it posted to HN, but none before had found that account. I guess I need to up my game.
If you don't mind sharing, are you "writing artificially" purely in your head, or are you using techniques like intermediate translations?
No mechanical means, but I have referred to a thesaurus occasionally. Mostly I tried to change my sentence structure, not just words. It requires actually thinking differently, in a way. Which makes it difficult to know how well I'm communicating.
2 replies →
See also: https://serhack.me/articles/unveiling-anonymous-author-stylo...
That post was actually what motivated me to make this. I'm on your email list :)
WOW! It's such a pleasure for me
Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL and other NSA tools and exploits? Shadowbrokers.
They were always communicating in some kind of meme-russian, and their texts were funny to read. [1]
I believe their writing mostly defeated this kind of analysis, at the cost of looking like idiots (which was probably the reason no one sent them crypto-dollars to buy that stuff exclusively).
Here's an excerpt:
"Attention government sponsors of cyber warfare and those who profit from it !!!!
How much you pay for enemies cyber weapons? Not malware you find in networks. Both sides, RAT + LP, full state sponsor tool set? We find cyber weapons made by creators of stuxnet, duqu, flame. Kaspersky calls Equation Group. We follow Equation Group traffic. We find Equation Group source range. We hack Equation Group. We find many many Equation Group cyber weapons. You see pictures. We give you some Equation Group files free, you see. This is good proof no? You enjoy!!! You break many things. You find many intrusions. You write many words. But not all, we are auction the best files."
[1] https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...
*EternalBlue
Have you tried including parts of speech (for example, as bigrams and trigrams) as part of the features considered in your model? I’ve had great success with stylometry that goes beyond TF-IDF with bags of words; including grammar patterns was shockingly good.
(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)
Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.
> Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.
That is a very good idea and when I update the site that will almost certainly be included :) Any other tips? Been reading papers for ideas and I think I may have to ditch the cosine similarity and go for something fancier soon. Thank you
How long until this becomes the algorithm for a dating site?
“Find hot single women who write just like you”
This seems like a great way to hire freelance copywriters/ghost writers too. I would absolutely hire someone I knew could match my tone well for writing generic unattributed copy.
Wouldn't be surprised if dating sites already used similar algorithms.
Do dating sites really use clever algorithms to match up people together? I was under the impression that, the less likely you are to meet your perfect match, the more you're going to use the app.
In my experience I don't see a relevant list of potential matches aside from gender and age preference, it's all completely random, even frequently I see people outside the settings I've specified (i.e. men or older women).
Wouldn't be surprised if most of the women on a specific dating site had very high similarity scores.
This is one reason why I like legal doctrines such as "beyond a reasonable doubt." Even a 0.9 match in a tool like this could be a coincidence, if there are millions of users. But that won't stop people from casually believing "aha it must be an alt account", based on some anecdata.
It's so easy for something like this to be turned into a tool for a witch hunt, targeting innocents.
But a 0.8 or 0.9 match and something like Tor usage could be enough to justify a warrant. That's why I'm not sure I want to open source the code because I don't want to normalize this.
Keep in mind the potential to create false accusations by fabricating similar looking accounts.
Hmmm, doesn't seem to work. But you have convinced me (and many others?) to search our alts consecutively and so now do know who has alts?
I wonder what's a reasonable threshold for "probably the same person". I've never had an alt on HN, and when I searched myself, it found 3 other users above 0.6, none of whom I've ever heard of before.
If it's >0.9 is you can almost guarantee it's an alt but I've seen certain matches at 0.6. The problem is writing styles change over time. Another idea I had was converting the scores which are just cosine similarity scores into percentiles (so 0.99 would be 99th percentile of certainty) to make them more human interpretable.
I make new accounts every so often and the accounts of mine that it found have a score of around 0.3. I'm not actively trying to defeat stylometry but it's possible I just have a particularly unremarkable writing style.
1 reply →
The people at 0.4-0.6 with me do share some interests. That's cool on its own.
>The problem is writing styles change over time.
Will be interesting if we could plot the writing style divergence over time.
I got matched with my old account with a score of only 0.45
I have no alts. The highest match for me is about 0.66.
Interesting. The highest non-me account is under 0.4 on my page. I do not believe that I have such a unique writing style - especially since half my posting is on mobile and therefore possibly slightly different than my desktop posts.
My closest is 0.4879. I know I tend to be wordy but I thought I had a pretty generic style as well. This is definitely a fascinating demonstration.
1 reply →
0.6 is not high enough to indicate an alt
Oh wow, it's really sure that I'm stavrosk, which I am:
https://stylometry.net/user?username=stavros
The next person is 30% less certain, that's huge! This would basically identify any alt I might have with near certainty.
Funny thing is, it thinks I'm you, but it doesn't think you're me!
https://stylometry.net/user?username=rogual
I'd have thought this stylometry thing would be commutative.
I guess it's a multidimensional space, so you can have someone closer to you than me, but they aren't also closer to me than you. Basically, they're close to you, but on the "other side" of me, I guess?
2 replies →
The word you are looking for is "symmetric".
stavrosk doesn't have any posts/comments? What's it using to match?
It's my old username.
3 replies →
This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?
HN has an Algolia-based API. It’s also very easy to crawl.
I wouldn’t call this evil, however: it’s merely demonstrating a technique that you should be aware of, if you’re a privacy-conscious person. It looks like they also provide some resources for avoiding stylometric detection.
I would bet my bottom dollar that the likes of Reddit and Google already have models to turn a corpus of text into probable demographic data and models to measure the similarity of users.
Please don't shoot at the messenger. costco shared this voluntarily and I can see no bad intention.
We should see it as an opportunity to learn how easy it is to associate different pseudonymous accounts. Nothing drives this point home better than a practical demo.
We can be pretty sure stylometry is used widely by bad actors already and we should not punish people who help to spread the word about these technical possibilities.
And this is actually quite a simple approach--which is interesting in and of itself. While there would be diminishing returns, there are a ton of other techniques you could use to make stronger inferences about similarity.
> This is an evil website. We won’t have any anonymity soon. The highest match is my years old banned account that I forgot about. Where did you get the data from?
I'd way rather have someone tell me "look at all the things I can find out about you" so that I can act accordingly (whatever that means!) rather than what we've mostly actually got, which is companies silently exploiting my data and doing everything they can to mumble reassuring but legally ineffective formulas assuring me that they deeply respect my privacy.
HN Firebase API. I just wrote a program in C++ with libcurl to get https://hacker-news.firebaseio.com/v0/item/1.json, https://hacker-news.firebaseio.com/v0/item/2.json, https://hacker-news.firebaseio.com/v0/item/3.json, ...
Why didn't you use the google bigquery?
https://news.ycombinator.com/item?id=10440502
2 replies →
I don't know that I'd call this evil. We have no idea who else is using this kind of technology but not making the results public. Better to know what's possible and take measures to make it less effective.
It’s just statistics. I recall that during his whistleblowing, Snowden intentionally took anti-stylometry measures.
Imagine using this across different platforms :/, and let alone using different techniques in addition...
edit: maybe you'd catch some criminals if you tried to match reddit against dark web for example
Interesting that the Op doesn't come up in the search: https://stylometry.net/user?username=costco
Their first comment and submission were 4 hours ago. Text on the page is accurate it seems.
Not surprising considering the account had no activity before today.
My nearest match is only at 0.406. It'd be interesting to see who the most unique commenters are, but it's also quite possible it wouldn't be flattering.
0.35 is my nearest. In hopes of lowering it even further, here are some nonsensical opinions never expressed on HN before: 1) Programming peaked with COBOL 2) Paul Graham is responsible for 90% of SIDS cases 3) There's no reason to use car when cdr exists.
0.2506 is my nearest match
That's the lowest I've seen yet. You must write uniquely :)
I have no alternative accounts besides making a single throwaway account to post one "Ask HN" five years ago, but I have a decent number of matches above 0.5. I think this is due to the relatively uniform style of "who is hiring posts," since my matches did that in a similar way for other companies. I made many of those for about two years when I was at a start-up.
On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?
Also think it's probably poor form to list users as examples without their permission.
> On the how to avoid section: Isn't running comments through a randomised translator a few times then back considered a countermeasure also?
Yes.
> This may be out of line but isn't pg on here with a different username, Levenschtein distance of one that's not included? Or is that just a very motivated 13yo account who writes a lot of admin-esque comments.
What other pg account are you referring to? I want to see it so I can see what my algorithm missed.
> Also think it's probably poor form to list users as examples without their permission.
You're right. I'll remove that - I just wanted some examples especially for people on phones who don't feel like typing. Thanks for the feedback.
> However, using automated methods like machine translation services do not appear to be a viable method of circumvention.
https://www.whonix.org/wiki/Stylometry
It found my old account (ara4n; i lost the password) at 0.63. More amusingly it found my cofounder too, who hardly ever posts here (at 0.48)
> ... This site works primarily by analyzing for each user the frequencies of the most common words and phrases in the English language. Accordingly, the easiest way to avoid being identified is to simply use different words than you ordinarily would when writing. More sophisticated models than the one I made can use punctuation, comma usage, and capitalization to identify you so try alternating those as well. Services like Quillbot can help with you this but depending on your circmstances you may not want to send your writings to a third party service.
HN offers many other threads which could be tied together, including:
- time of posting
- ratio of replies to top-level comments
- comments being mainly upvoted or downvoted
- sentiment (mostly angry, dismissive, questioning, etc.)
- most common topics (keyword analysis of post being replied to)
- ratio of new posting to post replies
- first-to-comment on a post
- lone comment on a post
- etc...
It seems very likely that sooner or later every pseudonym for posting content will get discovered and linked. The lesson here is don't post anything that would cause you undue shame or harm if linked directly to your legal name.
Well now I'm self conscious about my closest match being an 0.34 when so many other people are reporting much closer matches with accounts that aren't alts. Do I write weirdly?
Same for me, the closest match is 0.36. But I expected that because I don't speak english very well so the pool of candidates is small.
.31 here! I'm a non-native speaker tho, so it wouldn't surprise me if I had weird speaking habits
My closest is 0.40, so I’m right there with you.
Native English speaker as well.
0.36 here! Out of curiosity, are you a native speaker?
I am, yes.
1 reply →
What does the bold signify? For example when I search for dang (https://stylometry.net/user?username=dang) the 4th most likely user is not bold whereas the 16th is?
Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always).
Huh, that's a somewhat non intuitive property.
1 reply →
And this is why I’m a reader and not a poster on HN :)
The second that I found out that requesting deletion of an account and its posts needed a MANUAL request to a single user (dang) I noped out so fast
But happy that the rest of you are still happy to contribute :)
I really liked the informative and straight-to-the-point about page - describing how the algorithm works in a way that is easy to understand. All the important details are summarised there. Well done!
Edit: From the "How to avoid .." page, there is the following sentence:
> Also, most authorship identification algorithms have poor accuracy when working with small amounts of words. This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.
Can you clarify what this means and why it would result in a ban?
> Can you clarify what this means
Imagine that for every new comment you want to post you would create a brand new account which you would use precisely once and never again. Then the stylometry would have just a few words and wouldn’t have enough corpus to get a reliable signature. If a lot of people does this it would be hard to figure out which account belongs with which human. ( Of course if you alone do this, your messages will stick out like a sore thumb. See xkcd 1105 )
> why it would result in a ban?
Because this practice is especially discouraged in the guidelines: “please don't create accounts routinely. HN is a community—users should have an identity that others can relate to.”
At the same time, HN doesn't let you delete comments.
Maybe with some GDPR magic.
2 replies →
> Can you clarify what this means and why it would result in a ban?
I have seen dang respond to users multiple times asking them to stop making new accounts especially but not always if it's to avoid rate limiting. I don't know if there's an official policy but it's definitely something I recall.
Just a heads up that for everyone who doesn't like to link their alt accounts, maybe not use this tool to see if it works.
Unless the author would run this against all HN user accounts, no need to flag the ones "of interest".
Have you done any data analysis on distributions of similarity? How similar you'd expect any 2 people to be given English focused around tech? Or any other interesting stats you'd like to share?
Very nice clean site, great work.
What match level would you expect to see between two randomly chosen individuals?
It's accurate enough that I had to create a new account now :)
I guess it's difficult to evade it as the word frequency certainly catches all about the countries I frequently refer, programming languages, interests etc.
Similar to how they make adversarial fashion[0][1] in order to not be tracked by face id AI, I wonder if we can make adversarial stylometry tools to run your comments through in order to anonymize it
.. [0] https://hackaday.com/2022/10/20/render-yourself-invisible-to...
.. [1] https://adversarialfashion.com/
OP links to a paraphrasing tool on their website.
This is absolutely bonkers. I tried it with my alt and it got my original correct! So I'm writing this comment with a fresh account which hopefully will not get correctly linked too lol
Did something similar in 2018 (still running locally) which could damask anyone
https://twitter.com/austingwalters/status/104189476543920128...
Made both Metacortex.me and insideropinion.com
The idea being you don’t actually need an active directory. It would drop in, figure out all the users (provided one account was on the AD) and would monitor everyone’s skill sets, morale, schedule, etc. Worked super well for what it was / is.
Neat work!
Out of curiosity: do you filter sentences than begin with ‘>’, indicating a block quote from another user? That might improve the accuracy a little here, if you don’t already.
Yep!
Perhaps explain in the about what you filter out? Along with what the bolding means?! Do you filter out anything else (like spaced/indented/monospace text/code, or even quoted text, which is often not written by the user?). Super thanks for this - interesting!
1 reply →
Sorry dang, aka sctb: https://stylometry.net/user?username=dang
In this particular case, it seems to be picking up the stock moderation responses as it looks like sctb was a moderator account until 2019.
I don't have an alt but it would be cool to meet my stylometry-neighbors. I'm curious whether the writing similarity translates to oral communication too
I tried dang's old account (gruseom) expecting to see his dang account listed. Nothing. Tried dang, sctb (a previous admin) was listed as closest match.
I wouldn't rely on these results
https://stylometry.net/user?username=gruseom
https://stylometry.net/user?username=dang
I wouldn't rely on these results
You picked a user who posts a massive volume of repeat, template-y comments and found their former colleague who also posted piles of repeat, template-y comments, that being part of both of their jobs.
There are a few close matches to dang's style of template-y comments in the results. Afaik none of the listed accounts are Daniel.
I picked dang as he is the figurehead of hn, and didn't want to inadvertently reveal some other user's identity.
1 reply →
writing "antirez" shows accounts with spanish names (none is mine). I guess Italian and Spanish speakers write very similarly English, but on HN there are a lot more Spanish speakers than Italian ones so that's what I get.
It seems the accuracy for nonnative speakers is not nearly as good as it is for native speakers. The algorithm could definitely use some work.
Tried my account thinking "I don't have any alts" but it turns out I do! In 2018 I changed my username from "cbr" to "jefftk" and it pulled that right up: https://stylometry.net/user?username=jefftk
Rebrand it as a soulmate-finder?
Well done, it found my ancient old account.
I only got 0.9999999999999992 for myself :(
Naturally Born Imposter
Honeypot to see what accounts are tested in sequence?
;-)
I turned off nginx logging if that makes you feel any better. Of course there's no way for you to verify that because I'm just a random guy on the internet but I will tell you that I am a civic minded citizen who is concerned about privacy and the Internet.
Only half kidding, but I’d I were state Intel it’s what I’d be doing. :D
1 reply →
Ingenious idea. At the very least, this is just about finding people who write like us, the same way we seek those with similar tastes (music...)
How long before large commercial indexers start offering an efficient (AI based ?) stylometry to agencies and states ?
wait... do you think the NSA is already doing this?
They would be silly not to ( apart from creepish profiling of an entire globe population you also get to potentially identify bots ). We all have mannerisms that can easily 'betray us' online. I honestly thought my writing style is more unique, but as it turns out it is somewhat common.
It isn't writing style, but more of phrase selection. If you lean on the same phrases (n-grams), then you will be very very close in a high dimensional space. Colloquialisms are the biggest tell, you should eschew them.
> I honestly thought my writing style is more unique
You just showed another possible use case for this kind of tools: "How unique is my writing style ?"
Stylometry is an old hat technique; you can assume that intelligence services around the globe regularly apply it.
(Statistical stylometry is a little newer and more rigorous than manual stylometry, which essentially involved a human being's judgement call around the similarity of documents.)
What about "deep leaning" stylometry ?
2 replies →
Site down? I'm keen to see if it catches my alts.
Apologies for the downtime. Something crashed while I was asleep, should be working now. Not really sure how because the log indicates that uwsgi "gracefully exited," but I'm looking into it.
Same here, 502 consistently.
Apologies for the downtime. Something crashed while I was asleep, should be working now. Not really sure how because the log indicates that uwsgi "gracefully exited," but I'm looking into it.
Since it looks for similar word usage, false positives seem to appear more often when specific topics are talked about, like stocks or crypto.
Does this ignore stop words? Or do all words have the same weighting? I wonder if only focusing on stop words would give a more accurate measure. Maybe we are more comfortable with certain stop words more than others?
https://en.wikipedia.org/wiki/Stop_words
"Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant."
All words have the same weighting. I don't ignore stop words, in fact most of the ngrams I use are compromised almost entirely of stop words. Maybe it'd be more effective if I ignored them.
1. Interesting. I was kinda expecting to be grouped with other Russian speakers, and I am (based on some nicknames). Probably the frequencies of “the” and “a” are telling. But I swear to God that I sometimes spend some extra time trying to insert as much “the” and “a” in my texts as I could.
2. There is a Russian mnemonic verse, which can’t be properly translated to English, at least it’s beyond my humble capabilities. It goes:
“Это я знаю и помню прекрасно:
Пи многие знаки мне лишни, напрасны”
The number of letters in the words give you the pi number: 3,1415… The meaning is: “I know and remember perfectly: too many signs (positions) of pi are useless and impractical”. Sometimes it’s nice to remember both things.
Nice work! Thank you, of course I plugged in the obvious HN usernames
Edit to add;
Would be nice to have the https://news.ycombinator.com/user?id=username links included.
And perhaps rounding to 3 or 4 decimal places?
Amazing and I thought my doxxing tool was terrifying - https://news.ycombinator.com/item?id=32278871
I am afraid to combine all these methods
Yea.. i guess it's time to stop bothering with alt accounts/etc. I'll just make one account, maybe differently named on different services (makes scraping just a _pinch_ easier) but aside from that all i can do is modify/remove old posts.
Bit of a shame for useful posts/discussions.. but the internet is getting really.. finger print laden.
Incredible! There was a very active throwaway account here a while back that I always enjoyed interacting with. I suspected the person had more than one account and this found one that is incredibly close, down to the topics.
I checked a few random user names and I am confused.
- Why is the author costco[0] not in this lookup?
[0]: https://stylometry.net/user?username=costco
- Their first comment and submission were 4 hours ago.
- The text on that page is accurate it seems.
I played a little bit with it and it is baffling how well it finds accounts of people that know each other in real life. So it's not only good for finding alternate accounts but could be used to find peer groups.
Interesting, they are trading phrase-grams (just made that up) or lingo. That is really cool.
This doesn't seem to include text from submissions.
I ran it on Brian Armstrong's temp account from here, and it said it didn't write 10,000 characters:
https://news.ycombinator.com/item?id=3754664
EDIT: Or maybe it's something else because Brian only wrote less than 6k characters. But then why can my account be looked up?
Also, I would guess quoted replies are included, which muddies the analysis. Seems to be a very naive implementation. Much more can be done, but this was probably just a quick project.
Quoted replies shouldn't be included unless there's a bug on my end. Submission text is not included though I probably should have.
How much should we fear de-anonymisation ?
A lot of discussion on the thread are over "how can we prevent this". I would like to know why should we not embrace this and similar technologies?
The benefits in my view are large - online behaviour tracks back to real life - and epidemiology speaking the value of millions of test subjects across every question are invaluable - from traditional medicine to "mass psychology recommendations"
I can guess some downsides (hiding from abusive exes) but am interested in studies, surveys, reports etc - any HN thoughts welcome
Fear it happening or fear its consequences? Doxxing already happens all the time, but the main tools are things like account names or image search, this sort of tool could take it to a new level. A simple experiment would be to run this same algorithm against another site (say Twitter or Reddit) and see if it can reliably pick out the same peoples' accounts there. Once anyone on the internet can quickly/easily draw that sort of connection it would require incredible diligence to avoid de-anonimyzation while still maintaining any sort of "real self" presence on the internet. How much we should fear the consequences probably depends a lot on how marginalized you are within your society, but since just revealing your gender is enough to invite harassment in many forums I'm not optimistic.
>online behaviour tracks back to real life
This is good to you?
Okay, let's just make it like China or SK where your login is your citizen ID and if you write bad things the bad word police will take you away.
Also, no, I have no alts.
So I am asking because my views are only challenged inside my own head, hence the need for external thoughts.
But firstly the "governments will come and do bad things" argument - yes this is clearly and obviously a major problem - but not one solvable by technology in anyway. Fixing violent dictatorships is a IRL problem - one that requires enormous effort and sacrifices (see Ukraine for obvious example). We cannot pretend that a browser extension or a ground up rewrite of Twitter will defeat Putin or would have stopped Hitler.
As for "free" countries (something like 120+ have open free elections), we still have online abuse for voicing opinions that some people don't like (anything from pro/anti Trump to LGBT and bitcoin etc). Those are real consequences but rarely government inspired and honestly I suspect we need better support for police in prosecuting such things - I mean a death threat is a death threat.
In general my view seems to be we should have the same protections online as we do offline - and if those protections are "in theory only" that requires us to use our voting and other political power to chnage it - not to obfuscate IP addresses or so on.
The upside of tech is so great it is worth spending IRL to defend agains the downsides
3 replies →
What could possibly be the harm in allowing people to harass others based on posts they made decades ago? What could possibly be harmful in making a person who for whatever reason has changed their online identity easier to track? What could be remotely harmful about allowing Marlboro to find the accounts of ex-smokers? What could be the harm in tracking underaged users site by site?
I'm sure this is completely harmless and will not harm society.
I think this might be old age creeping up on me but I find it harder and harder to work backwards through "argument by sarcasm" to arrive at what you meant. I think clearly you are heartfelt in your views that having your identity online be a real one is bad - but I am not sure if that is because of posts you made years ago being linked back to you or nefarious advertising ?
The old posts issue is interesting- do you mean that there are posts from years ago you would find upsetting to be linked to you? Is this because you have chnaged your mind (a normal process society needs to understand) or because you said things thinking yiunweee anonymous that you would not have said under your real name? Far less of a social issue I think.
It does make for some interesting thoughts if we made everyone post under their real name.
2 replies →
Amusingly can't run it on the author since not enough comments
I have only ever had a single account but it returned 19 possibles with no confidence above .54 but 11 bolded. My own account was listed at the top with a confidence of .9999.
Yeah, I have a bunch of bolded mutuals but none above 0.45. I think I have had one or two alts in the past, but probably they didn't make the 10000 word threshold for inclusion (nor can I remember their names to check if they work in inverse).
Why are some users bold?
Say you see user2 listed in bold on user1's page. That means that user1 is also in user2's top 20 users. In my experience it is often an indicator of a good match (but not always). I should probably explain that on the site.
Instead of making it binary, you could use a gradient indicating the strength of the mutual correlation (like how HN colors downvoted comments).
The non-bold are dead accounts I think
It isn't due to a mere property of the user, as, for example, cushman is not bold as the #2 result for tptacek but is bold as the #2 result for icambron.
2 replies →
The bias is interesting here.
https://stylometry.net/user?username=nickstinemates
Number 2 for me is someone I worked closely with for a few years, and then putting his name into this results in all of the people we worked with for a few years. So it seems content>style, or, we are all more alike than we thought.
I'd be very curious to know if these algorithms can link very different types of text. I'm not surprised that my style is "derivable" on HN, but what if you included my slash-fic pieces, my research papers, etc, would it still "catch" me?
Also, talk about a chilling effect. I was already vaguely aware of this, and now I'm overthinking every word I'm thinking/typing.
I'm gathering that they just took a bag-of-words approach to this; basically comparing word frequencies. Writing across content types (fiction vs technical writing for example) will probably show different word frequencies, especially technical jargon, and so on. More sophisticated approaches are possible.
And yes, potentially very chilling. If you want to post truly anonymously, you might want to run your words through some kind of filter first.
Oh god, that thing starts with direct focus on the search field, opening it showed a bunch of old nicknames, I thought it was the result of some study.
The top hit for me, though not a very high correlation (0.3 ish), is to my surprise someone I have met. I don't appear on their top 20 though.
Can we find Satoshi with this?
A few people have tried that e.g. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184...
[dead]
I interviewed years ago with someone who let me know that they use a pseudonym as an employee and their chosen name even got posted as the author for articles they wrote for the company. They were very concerned about their privacy.
I know their blog, which is their HN username, and this tool found their other account.
Perhaps ironically, this person stood out a lot because of this and I didn't forget them.
It's funny that I only match at 0.9999999999999982 with myself while all other username I tried matched with themselves at 1.0 ^^.
https://theuijunkie.com/myth-or-fact-did-charlie-chaplin-los...
Huhu
Sticking myself in (I haven't ever had another account) my closest match (at 0.43) is the maintainer of an Open Source project which I have occasionally commented about. They are also British, as am I.
My guess is that as they commonly mention the project and I have on a number of occasions, that has formed the link. Plus maybe usage of common British terms, but that seems far less significant.
It's super interesting!
It would be good if there were more controls to filter the type of words and language that are used for the matching algorithm. So you could say exclude words not in the dictionary. I wander how that would effect my link with this other person.
That’s why I always use throwaway :) everywhere. Reddit. HN. Twitter. Everywhere. I’ll spam every site with my throwaways.
Long live throwaways.
That’s the point of this post, that you are not safe by throwaways at all, because all of your throwaways can be linked together purely by your textual style.
No they can’t. If you only have a small amount of text to work with, stylometry is unreliable.
> This means the optimal strategy would be discarding an account either after every comment or after a small number of comments. Unfortunately, this is against HN rules and may result in a ban.
Is this? I thought that it was ok to have throwaway accounts, as long as they're not specifically to avoid a ban or something like that.
I find this tool to be disturbing. It is reality so I accept it. But I'm going to make effort to change my style between accounts.
A question for the author (costco): You created that account in 2019 but you didn't post or submit a single thing until 4 hours ago. Why did you create an account almost 3 years ago for no purpose?
Alone out here in the 0.30s. the three times I've used a throwaway account, they've been for a single post on a single topic, so no surprise they did not get picked up by this analysis I guess.
Does a low correlation with other users imply higher susceptibility to de-anonymization if I were using alts regularly?
Probably. It means your writing is more unique and using an alt would be another "very unique" but only similar to yours.
There's someone (michaelmior if you're around!) with a false positive 0.46 match to me.
Maybe we could be friends :)
Not sure if that is a false positive. It just lists the top 20 accounts ranked by similarity score. Under 0.8 or so is unlikely to be a 'positive'.
This needs to exclude who’s hiring post because it confuses me with a few of my wonderful former colleagues!
Well the only solution is too have too many alts so that nobody can believe you can possibly have that many
Wow. This is insane, it found my old accounts. So throwaway obviously (because I'm a bit of an asshole) but this really is amazing. It also highlighted another account that's not me, but looking through their comments i don't see any resemblance to me either.
I've complained a lot about Haskell and now it thinks I like Haskell =(
Needs sentiment analysis IMO, otherwise you'll get "Here's a bunch of people who are JUST LIKE YOU", except they use a similar grammar style but hold opposite opinions on the same nouns.
It just thinks you engage a lot with Haskell. These are people with who you have something to talk about. :)
Serves you right for disparaging The One True Language!
Ok, fine, we'll present Idris with a fig leaf.
I have two accounts. This one, “soneca”, that is my first one and most active by far, and another one that I use sometimes mostly for Show HN and few comments.
When I searched the other one, “soneca” was the first guess, with 0.4.
But when I searched “soneca”, the other one was not in the top 20.
Those interested in the implications of this kind of analysis might enjoy the book The Secret Life of Pronouns http://secretlifeofpronouns.com/
Thank you for this.. I thought I was being careful but evidently it's not enough. It found 13 of my previous accounts with the topmost being 0.4937 and lowest one being 0.3616 bold. All the bold ones were right, some correct matches weren’t bold.
Seems pretty spot-on to me. I tried it with two accounts I was already certain were alts - based on other factors like favorite topics and common enemies as well as style/tone - and the top hits for both were the ones I would have expected.
Very interesting, .59 is my lowest, .64 is my highest match, none of these accounts are one of my alts. Though to be fair the handful of times I've used a throwaway I used it for a single comment so I didn't give it much to go off.
Anything like this for Reddit?
Would translating to other language and back defend against this algorithm?
> Anything like this for Reddit?
No but it would be easily adaptable especially given that Pushshift is archiving every Reddit comment. Based on some of the feedback I'm getting here I don't know if I should open source this even though it really wasn't that hard to make.
> Would translating to other language and back defend against this algorithm?
Yes. But then you have to send your original comment to a translation company so there are privacy concerns there too.
> Based on some of the feedback I'm getting here I don't know if I should open source this even though it really wasn't that hard to make.
I'd say you should. I'd rather see this as being publicly and freely available to everyone rather than some shady "Big Tech" analytics company.
If the "weapons" exist, I would feel more comfortable knowing everyone can access them, not just an elite that can use it for their own (selfish) purposes.
2 replies →
I wouldn't worry about that too much as someone's already done something similar for reddit (https://towardsdatascience.com/using-nlp-to-identify-reddito...), and has released their code publicly (https://github.com/jabraunlin/reddit-user-id)
Given the technique used, I don't see why something simple and local wouldn't defeat it? The "easiest" technique would be to use this weighting as a negative metric in rewriting.
> But then you have to send your original comment to a translation company so there are privacy concerns there too.
There are modern offline translation systems available such as Bergamot https://browser.mt/
Trailing (and probably leading, didn't check) spaces confuse the user lookup.
I wonder how much this can be improved if metadata is taken into account as well. Especially the distribution of common post dates and times modulo a week, which also exposes in which timezone somebody probably lives.
On one hand, thank you for showing us all how easy it is to make something like this. No doubt organizations with more resources already have more sophisticated systems in the same vein.
On the other hand, can we agree that this product is unethical?
In many cases, when a person uses an alt, it is a direct and strong signal that they do not wish their other posts to be associated.
So this product is circumventing the explicit will of the person, and making it available to anyone with zero effort i.e. there is no barrier to getting this info.
I met someone about 10 years ago who said they built this at a university. And their argument also was "actually this enhances privacy because it lets you know something something something". And yet their research grants were coming from one source only.
It can be used for good, but most often it won't.
<< On the other hand, can we agree that this product is unethical?
It does create a high level of discomfort, because it illustrates well what privacy advocates try talking about to the population at large, but all that said.. how is it any different from regular scraping and analyzing it any other way?
This is a real question.
It's different because you're removing all barriers to access and making it easy and convenient to stalk/dox people.
Imagine you get the urge to track someone, but in order to do that you have to spend a week writing some new software. That's a barrier. And because of it you may change your mind because it's a lot of work with little payoff.
But if that info is just one click away, it's a whole different ballgame.
1 reply →
> On the other hand, can we agree that this product is unethical?
No.
Fun exercise would be to find all accounts that suddenly stopped posting around today and correllate them with new accounts created around today.
All those scared folks who naively think that it's not too late yet. Busted.
502 Bad Gateway
Apologies for the downtime. It is up again and I'm looking into why uwsgi crashed.
The asymmetry is interesting. I have no alts but of course it nonetheless reported accounts similar to mine.
Running then the most similar person to my account did not put me in their top 20.
I believe this is the https://en.wikipedia.org/wiki/Friendship_paradox
Very cool.
I’m guessing that a small corpus for a given account doesn’t produce a very good score? I’ve done throwaways a couple times in the past and this has not “outed” them.
I've only had one account here. The highest match has a 0.624 score and the lowest a 0.572. I'm not sure if that means I'm unique or common but I'd like to know.
One way to get around this legitimately would be by posting a lot of quotes/lyrics/excerpts and the like thus fooling the algorithm unless it had a way to filter them out
This has been a great way to find people whose commentary I enjoy!
We knew this was possible and was coming, and probably around a few years. Fascinating from a technology perspective, terrifying from a long-term privacy perspective.
It's moments like this I'm proud to have my insanity on full display without obscurity. Was surprised to see a bunch of ~30% matches despite not having any alts.
My runner-up has a rating of 0.42378790667730715
C'mon guys, work harder. That's not even close! :-D
Btw, I myself am only at 0.9999999999999999 so I guess I need to work harder at being myself.
I tried it on a few user-ids that I strongly suspected were owned by the same person. My hunches stand corroborated. Not sure who is corroborating whom though, me or the script.
Good job.
Oddly, I am not an exact match to myself.
> Most likely candidates:
skymarshal: 0.9999999999999997
The other few usernames I tested (pg, dang, some random ones from this thread) all matched themselves at 1.0.
I had hard time to understand some comments made by my closest match. I guess this is good reality check. I need to learn how to write more legible posts now.
Sorry, what did you mean? :P
It didn't find my alt, but the second match is one of my twitter mutuals - I wonder if we've inadvertently borrowed style quirks from each other.
I wasn't aware this was even a thing! Scary stuff. 2 alts are listed but not with any great accuracy, so easy to dismiss. What an interesting topic.
Does anyone here have a reasonably wide variety of similarity ratings? I'd love to see the difference between a 0.2 and a 0.8 for the same account.
Interesting; I must have a fairly unique style as there are no matches over 0.40 for me.
I’m a native English speaker as well, so I’m unsure how to feel about that.
> I made this site mostly to show how easy this is and how it can erode online privacy
looks like it can indeed
> Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)
How surprising that someone might object to being included in a demonstration of the erosion of privacy!
Is the site opt-in or opt-out?
I doubt they asked 78k users for permission when there's no standardized way of reaching out if you're not a site admin. It's opt out if anything.
You opt into making your writing publicly available when making posts on this site. I’m not sure what Ycombinator’s user agreement* says about this, but it is pretty obvious that they haven’t done anything to prevent it (and it isn’t clear what they could do).
* and I mean they author of the tool is here making posts, so I guess they have agreed to the TOS, but clearly someone who hasn’t agreed to it could also make this tool and scrape out publicly available posts without agreeing to anything.
Is it weird that my rating is very low compared to alternative options? I have no alts, but I'm curious how similar others might write to me.
What is the threshold to be reasonably confident that two accounts are from the same individual?
I ever had only one account here and the closest match is at 0.47.
ive had maybe a hundred throwaway accounts on HN over the past ten years. generally, i make an account, say something that is apparently wildly offensive to someone else, get flagged and down-voted and then muted or hell-banned. then i make another account because i never did anything wrong and start the process over again. ive emailed the admins, tried to reason with the admins, it never does any good. the power is held by power-users who flag people -- most of the power of an admin at the end of the day but without any of the accountability. as long as they are following the mainstream dogma, its all good.
anyway, this app was able to identify a lot of my accounts. but a lot of the matches werent me. bold matches were almost all me. but i know there are many more matches than those that were listed. it mainly showed my most recent accounts.
i think most people would get a sick feeling in their stomach if they tried this app. i dont think people are prepared for a world where you can type someones name into an app like this and produce everything ever recorded online that was created by that person. not only this but everything highlighted and summarized to answer any question about that person. this is what advanced ai will bring us. an information implosion where the planet-sized ocean of data that is just floating all around us suddenly and violently coalesces into the objects of our new societal calculus. violent is a good word. and this is just the change that one can see coming with ai.
You are definitely right. Part of the reason I chose the 10,000 character minimum was so that people using throwaways in the true sense would be entirely excluded. I don't plan on keeping this up forever and I too would not feel comfortable if this was deployed at scale.
Would you be open to open sourcing the code when you decide to shutdown the service?
You really don't need advanced AI to do it. Just a bunch of scrapers and some run of the mill statistics. And guess what, it's been done by many companies already. They just don't care to create such a site.
you have no idea what im talking about. you dont realize how much data is out there. you dont comprehend how much smarter than you something can be.
pretty cool- i think there should be a term for two accounts that have each other as the top most similar account. kinda sad i dont have one :(
We’re pretty close me and you — closer than my actual alts
hello friend! but... id never use an m dash
1 reply →
Stylotwins?
Make a fundraiser and start doing it for other sites.
It would be possible for Reddit because Pushshift.io archives all the comments there and Reddit is still pretty small. I'd probably need to make things a lot faster. Doing it on a specific subreddit would be very feasible. I'll think about it but I don't actually know if I really want to do that because for instance I've been banned from subreddits before but I don't want a ban from when I was 12 years old to follow me around forever because my writing style hasn't changed. Moderation is the most obvious application of this kind of software.
> I'll think about it but I don't actually know if I really want to do that because for instance I've been banned from subreddits before but I don't want a ban from when I was 12 years old to follow me around forever
Insightful that your personal experience and impact on you personally affects your decision. I invite you to think about the impact of the products you build in your CS career by putting yourself in the shoes of other people as well.
Some products should not be built, even though it's easy to build them.
1 reply →
Clicking on my top match (0.61) - I can see the similarity. I also note they quote the same way, with a > symbol. I wonder if that helps!
Inserting random Unicode blank, 1/4, 1/2, or zero space characters into your writing may help thwart it too, if you are paranoid
Would thwart this tool, presumably, but not anything which considered spacing ("do they use double space after a sentence?") and punctuation, etc., as markers.
Huh, that’s how I signal my KGB handler…
Very cool! And really a shame that you’re not allowed to delete an old alt account or comments on HN! It follows you forever apparently.
All false positives for me - I want to reach out to the accounts that talk similar to me and see if we make good friends
Maybe this is a good tool to find new friends. :P
How do you protect yourself from impersonators?
So what are some good tools to obfuscate style?
It found my “alternate” account. If someone puts my username in, it’s not hard to figure out which alternate is mine.
No alt, and the highest match is 0.36
And that accounts last several comments were flagged as dead.
I'm a native speaker, but my english succcccks.
Funny thing would be to find most unique user account stylistically.
Which user has lowest best match?
Mine is 0.58 so I'm really not that unique.
Fractionally more unique with a best match of 0.547.
would probably work better with case and punctuation preserving n-grams, sentence length, paragraph length and use of whitespace stats.
also maybe a tf-idf vector of top n words per user.
also could maybe do a same phrase analysis across the corpus to find some hand picked features.
timestamps could be interesting.
or, of course, let the machine do it with comment2vec.
I was curious to use this on myself to see if anyone writes like me. Closest was a .51 confidence, so I guess not?
This is cool!
If an account returns a high score for many accounts, does that also mean they’re relatively less original in style?
It puts almost all of my old accounts decently near the top, but my original account is almost comically low.
Cool! I wonder if it could be run backwards, to identify the users on hackernews with the most unique voices.
This is creepy.
I think the word you are looking for is uncanny
My alt accounts (not really, all below 0.5) seem to also be European or German Firefox users. Good for us ;)
Obviously the next thing to do is make this a popup on someone's account name when you hover over it.
This is super impressive!
Is there a common open source library (Python, JS, whatever) that implements something like this?
> imagine what a company with millions of dollars and a couple dozen PhD linguists could do.
Could they do much better?
How much writing do you need to analyze results? Would changing account every X sentences eliminate this?
Current minimum is 10000 characters. In my own tests accuracy was still pretty good at 3-5000 but I instituted the 10000 minimum to reduce false positives. Yes it would, if you read the advice page on avoiding detection that is one of the things I recommend. Unfortunately HN moderators do not really like that.
I have no alts, but to those of you compared to me by this engine : "Hey, good lookin'!"
wow, this is way off on me, didn't find my alts and the bolded accounts on my list are from different countries, use language I'd never use (cusses) and I see I've downvoted some of them...
I'd love to have the experience and or apparent wealth my "alts" have
This is great.
One funny thing though, while your example says 1.0, for my own account it says 0.99lotsof9s4
I like the way some usernames are only 0.9999999 correlated with themselves.
Perhaps 6 or 7 digits is enough?
This found an old account that I forgot I even had but with a lot of false positives. Neat!
I have no alternate accounts, and all my matches are below 0.4 for whatever it’s worth.
Interesting, but it gave me 20 accounts, and I know that I only have this one.
Sorry for any misunderstanding, read https://news.ycombinator.com/item?id=33756725
Sounds like a nice tool to find friends. You locate people who might think like you.
Strip leading/trailing white space from the name if it says no match.
I would have expected to be a closer match to myself.
> uberduper: 0.9999999999999991
Well, one of the closest on my list is my twin, so there's that.
Love a little NLP project on a public dataset - thanks for sharing!
Would this work for Fernando Pessoa and all his heteronyms? :)
I’d like to request the author takes this offline please until the implications can be thought through.
This is breaking anonymity that people incorrectly thought would not be revealed.
For some it might be awkward, others it might be quite problematic.
I would agree with you but the genie is out of the bottle already. Nigh everyone can and could have reproduced these results, especially that archive.org and similar things exist.
So, I don’t think it causes any new harm, if anything it gives you future risk aversion.
This is nothing new, e.g:
Analyzing stylistic similarity amongst authors
http://markallenthornton.com/blog/stylistic-similarity/
37 points by lingben on Aug 12, 2015
This is not complex and is a well known method that state actors have been using for quite a long time. Governments have FAR more advanced ways to track you than this, but it's good for people to realize it exists.
Found my phone account; I'm quite impressed, really !
Haha, you got me and my main account. That's spooky.
Im tempted to use it to find likeminded friends :)
This could be a good idea for identifying bots.
Not sure if GPT3 at least if prompted right would have clearly identifiable style. Could probably detect converted call centers in Russia or Cambodia where 50 employees post on 10000 accounts though.
at what threshold is it considering alt account?
There is no threshold. This site does not make any call as to whether a user is an alt or not. It just gives the users with the most similar word choice and from there it is up to you to decide (is there a very specific detail that both accounts mention, do they post at similar times, etc). I will say bolded accounts are substantially more likely to be alts though. But obviously it is not guaranteed that every user has an alt.
Jokes on you, this is my one and only account.
Are short sentences better for anonymity?
Well, interesting. This is one of the reasons we have the GDPR. @costco, if I were to make a GDPR erasure request, would you service it?
And I'm no lawyer, but it seems like there's also an outside chance of a breach of section 171 here as well, which is a criminal offence committed by a person who reidentifies de-identified data.
Plus - the laws have extraterritoriality. Vanishingly unlikely that you'd actually be pursued for it, but it's worth bearing in mind when you munge people's personal data.
It's an EU law.
With extraterritoriality. And if identifying people in this way is covered (I'm not a lawyer, I'm not claiming it definitely is), then it's also possible that EU citizens using the tool are committing a criminal offence.
The law seems to only apply where the deidentification has been made by the data controller, but HN admins changing someone's username, for example, if they ever do, would count. A person then using the tool to match another non-anonymous username to that account would seem to be caught.
Important to stress how much of a technicality this is, but that sort of thing can be interesting sometimes.
Wow... that's shockingly effective
Welp, so much commenting for me then.
Site seems to have been down when you commented this. If you want to try again it is up again :)
What's a high correlation number?
Are you going to try it on Twitter?
Now I can find my HN doppelganger
heh, I looked up the top bold hit for my name and they really do sound a bit like me (:
writing from throwaway:
Holy shit, it works really, really good. It found all of my older accounts.
What algorithm is being used?
It's described here: https://stylometry.net/about
I changed my nickname so my employer can't find me here. I'm not amused by this.
If this basic implementation can catch you, I’d consider it a friendly reminder that changing your account name is not a very effective means of adding privacy.
New account, then translate your comments to Spanish and then back to English using Google translate.
The website is down...
Apologies for the downtime. It is up again and I'm looking into why uwsgi crashed.
Now do one for reddit
why is my username not exactly equal to 1? https://stylometry.net/user?username=julienreszka
Python/floating point rounding error. It doesn't mean anything.
does it use the most used words or least used?
Possibility to hide user comments in profile should be optional.
didn't find a single one of my alts. nice
I obviously don't expect you to help me but do they have at least >10000 characters written and are you varying your writing style in any way?
Of the top ten accounts listed for my name two of them are me.
nice one. are you using gpt3 under the hood?
I'm not that smart - my site is basically just doing some calculations on word frequencies. You can read https://news.ycombinator.com/item?id=33755898 for more information.
As you mention on the site, you don't do punctuation. But I'm guessing there are some pretty good fingerprints like:
two spaces after a period
Whether someone uses an em-dash/single hyphen/double hyphens (which may correspond to house style they're used to)
Whether they use semi-colons
(Presumably harder) but consistent substitutions like loose for lose, break for brake, etc.
Use of accents
1 reply →
Simplicity is the greatest form of sophistication! Great work!
One small nit from a user experience point of view..: it'd be easier on the eyes if you just truncated those cosine similarity scores (or whatever score you're using) after the, say, 5th digit. Showing the entire float is kinda messy to my eyes.
Don’t sell yourself short. Simplicity is smart. It’s astonishing how often the simplest thing turns out to be exponentially more effective than the so-called smart thing.
I can’t get over how phenomenal this is. Please put every one of your side project ideas into production!
I am curious whether it could pick GPT3 out of the crowd.
Its easy to write complicated systems, it takes a genius to make it simple.
cool and thanks for the clarification. i ask that mainly because of the request limit of openai, which is something that makes many scalable ideas unfeasible
we leave fingerprints everywhere
ColinWright is Dang?
Woah
totally on spot
my current and my old account
w
Wow... how !
Over in the D language forums, we welcome people who post under a pseudonym, and our policy is we won't allow attempts to unmask them.
This is to protect high profile users who are secretly enjoying programming in D rather than the language they are supposed to use.
And, of course, to protect users who feel they might be discriminated against if their background was known.
It's very important for those people to be aware of these style analysis attacks! Glad this post is raising awareness.
What's up with cluster of users like:
j_s,password4321,carolinew,colinwright,kuharich etc.
https://stylometry.net/user?username=j_s https://stylometry.net/user?username=carolinew https://stylometry.net/user?username=colinwright https://stylometry.net/user?username=password4321
Lowest match for j_s is 0.80 and all but one is black.
On a cursory glance it looks like a cluster of users that post links, especially with italicized quoted excerpts.
Most likely candidates:
Not to diminish one bit how you're feeling, but the bright side is: Today you know this is easily done (information you didn't have yesterday), that the creator had no intention of "outing" you specifically, and that you can take steps to obfuscate this specific aspect of your posts that connects your public alts.
If you want to ask HN to remove your data, send a message to hn@ycombinator.com.
Yes, sadly. In this case, it'd be an arsehole move, but good point.
Not today.
You fail, I win.
Nice. Just out of curiosity are you taking any countermeasures or varying your writing style across accounts in any way?
My second closest match was 0.35 but searching people where they have matches 0.5-0.75 I suspect that's mostly to do with number of posts leading to better statistics.
yeah I vary my writing styles. Much of the stuff I post through this account is controversial, to say the least. So I have to take "measures".