Comment by spdustin

3 years ago

Have you tried including parts of speech (for example, as bigrams and trigrams) as part of the features considered in your model? I’ve had great success with stylometry that goes beyond TF-IDF with bags of words; including grammar patterns was shockingly good.

(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)

Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.

> Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.

That is a very good idea and when I update the site that will almost certainly be included :) Any other tips? Been reading papers for ideas and I think I may have to ditch the cosine similarity and go for something fancier soon. Thank you