Comment by zipy124
3 days ago
A lot of people are commenting on the conclusion but I'm surprised no one is commenting on the methodology? The distributions given by the models seem weird. The LLM's enough so that I would just discount those and focus on the BERT models, but even then roBERTa for instance seems to suggest there is NO positive sentiment, with only scores of 0.5 and above given. Then there is the axis which is "ai_sentiment" against the classification, but it's not clear what "ai_sentiment" is, and it's not expanded upon in the paper. It seems to basically just map to the DistilBERT score apart from a few outliers?
Given that, it seems that there is basically zero agreement between DistilBERT and the other models..... In fact even worse they disagree to the extreme with some saying the most positive score is the most negative score.... (even acounting for the inverted scale in results 2-6).
Fair point, `ai_sentiment` should have been defined explicitly. It's the production score from DistilBERT-base-uncased-finetuned-sst-2-english, the same model family as Cloudflare's sentiment classifier. That explains the r=0.98 correlation you noticed. And you're right that the models disagree. This isn't measurement error though. They learned different definitions of "sentiment" from their training data. DistilBERT was trained on movie reviews (SST-2), so it asks "is this evaluating something as good or bad?" BERT Multilingual averages tone across 104 languages, which dilutes sharp English critique. RoBERTa Twitter was trained on social media where positivity bias runs strong, hence the μ=0.76 you see.
For HN titles, which tend to be evaluative and critical, I assumed DistilBERT's framing fits better than the alternatives. But the disagreement between models actually shows that "sentiment" is task-dependent rather than some universal measure. I'll add a methodology section in the revision to clarify why this model was chosen.
Thanks for clearing that all up for me, look forward to seeing the revision!
It would be interesting to see some of the comments that seem to be polar oposites in sentiments between the models. So ones where they are the most positive sentiment by one model but the most negative by another to analyse the cases where they disagree the most on their definition of sentiment.