← Back to context

Comment by a-dub

3 years ago

would probably work better with case and punctuation preserving n-grams, sentence length, paragraph length and use of whitespace stats.

also maybe a tf-idf vector of top n words per user.

also could maybe do a same phrase analysis across the corpus to find some hand picked features.

timestamps could be interesting.

or, of course, let the machine do it with comment2vec.