Comment by bravura
13 years ago
atpassos_ml told me an even cooler heuristic.
He was doing some research on detecting which comments are authoritative based upon textual analysis (no username or social analysis). They made a complicated topic model, but found that the following heuristic is almost as good for automatically detecting authority in comments:
Favor the person with the broadest vocabulary compared to other people in the thread.
This was evaluated on Yelp and Goodreads. IIRC it may have also been tested on HN data.
(reference: Alexandre Passos, Jacques Wainer, Aria Haghighi, What do you know? A topic-model approach to authority identification.)
The problem with disclosing these heuristics as part of your filtering algorithm means that people will try and game the system. They'll include URLs and expanded vocabulary to get higher-ranked comments. And then we win. (Relevant: http://xkcd.com/810/)
URLs and expansive vocabulary? That perfectly describes most spam I get, a viagra store link and a random Hemingway excerpt.
Sure, but spam is already reasonably under control here; post-spam-filter, those things could be predictive, along with other things nobody thinks of.