← Back to context

Comment by jll29

3 years ago

The method used, i.e. to calculate the cosine of the two authors' word vectors, is poorly suited for stylometric analysis because it is based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).

Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).

This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...

> based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Interesting. I was expecting to be grouped with other Russian speakers and I am (based on some nicknames). But I thought the most telling feature will be exactly word order - it’s absolutely relaxed in Russian. Word frequencies? Well, probably the absence of articles, lol (but I swear to God that I often spend some extra time trying to insert as many articles in my texts as I could).

There’s https://en.wikipedia.org/wiki/Idiolect :

”Language consists of sentence constructs, choice of words, and expression of style. Accordingly, an idiolect is an individual's personal use of these facets. Every person has a unique idiolect influenced by their language, socioeconomic status, and geographical location.”

In practice a more complex approach will tend to require a greater amount of data per user, so in this specific case this simple approach is not too bad. Moreover, fake accounts are likely to talk about the same topics, so while this leads to false positives, also makes it more likely that in the list we find actual duplicates.