Comment by pg
13 years ago
One thing I learned from spam filtering is not to underestimate what a statistical filter can find.
For example, the comment in question contains a url. I could easily imagine that turning out to be a valuable predictor. The defining quality of the middlebrow dismissal is that it's a cache dump of the writer's prejudices, and someone doing that doesn't even take the time to think, let alone look up urls; they're not even really writing to inform.
atpassos_ml told me an even cooler heuristic.
He was doing some research on detecting which comments are authoritative based upon textual analysis (no username or social analysis). They made a complicated topic model, but found that the following heuristic is almost as good for automatically detecting authority in comments:
Favor the person with the broadest vocabulary compared to other people in the thread.
This was evaluated on Yelp and Goodreads. IIRC it may have also been tested on HN data.
(reference: Alexandre Passos, Jacques Wainer, Aria Haghighi, What do you know? A topic-model approach to authority identification.)
The problem with disclosing these heuristics as part of your filtering algorithm means that people will try and game the system. They'll include URLs and expanded vocabulary to get higher-ranked comments. And then we win. (Relevant: http://xkcd.com/810/)
URLs and expansive vocabulary? That perfectly describes most spam I get, a viagra store link and a random Hemingway excerpt.
Sure, but spam is already reasonably under control here; post-spam-filter, those things could be predictive, along with other things nobody thinks of.
Funny, I've been mentally building a model the last few days and thinking that URLs that are frequently referenced in comments are likely to be good indicators of middlebrow dismissals. Thinking specifically of links to the various "laws" of internet discussion, logical fallacies, and so on.
But also potentially looking at things like the urls that get linked into regular discussions such as weight loss or political topics. That might go beyond the immediate scope, but comments with those links might have other factors similar to middlebrow dismissals, so they might be worth building into the model.
Trying to catch something like this post would be interesting to add. Maybe an anti-indicator if the link adds value. Then you get into figuring the value of the content of the link. Maybe compare it to the content of the original link, which you would want to be similar, but not too similar. Yeah, that might get involved.
Intriguing. I'd be interested in whether the presence of the phrase "at the end of the day" correlates strongly with low comment quality.