← Back to context

Comment by 7777777phil

3 days ago

Fair point, `ai_sentiment` should have been defined explicitly. It's the production score from DistilBERT-base-uncased-finetuned-sst-2-english, the same model family as Cloudflare's sentiment classifier. That explains the r=0.98 correlation you noticed. And you're right that the models disagree. This isn't measurement error though. They learned different definitions of "sentiment" from their training data. DistilBERT was trained on movie reviews (SST-2), so it asks "is this evaluating something as good or bad?" BERT Multilingual averages tone across 104 languages, which dilutes sharp English critique. RoBERTa Twitter was trained on social media where positivity bias runs strong, hence the μ=0.76 you see.

For HN titles, which tend to be evaluative and critical, I assumed DistilBERT's framing fits better than the alternatives. But the disagreement between models actually shows that "sentiment" is task-dependent rather than some universal measure. I'll add a methodology section in the revision to clarify why this model was chosen.

Thanks for clearing that all up for me, look forward to seeing the revision!

It would be interesting to see some of the comments that seem to be polar oposites in sentiments between the models. So ones where they are the most positive sentiment by one model but the most negative by another to analyse the cases where they disagree the most on their definition of sentiment.