Comment by a-dub
3 years ago
would probably work better with case and punctuation preserving n-grams, sentence length, paragraph length and use of whitespace stats.
also maybe a tf-idf vector of top n words per user.
also could maybe do a same phrase analysis across the corpus to find some hand picked features.
timestamps could be interesting.
or, of course, let the machine do it with comment2vec.
No comments yet
Contribute on Hacker News ↗