Comment by 7777777phil

2 days ago

Sent you a message to the mail in your bio but I thought I also leave this here: Again, thank you for your comments. I went down a rabbit hole reading your essays Hacker News For Hackers and Classifying Hacker News Titles.

I've also been building an RSS reader with integrated HN success prediction – essentially trying to surface which articles from my feed are most likely to resonate on HN before I submit them. Your research directly informed several decisions, so I wanted to share what I've learned and ask for your insights.

V1 DistilBERT baseline: Started with a fine-tuned DistilBERT on ~15k HN titles. Achieved ROC AUC of 0.77, which felt promising. Clean architecture, simple training loop. V2 Added complexity: Switched to RoBERTa + TF-IDF ensemble, thinking more features = better. ROC AUC dropped to 0.70. Precision suffered. Calibration was poor (ECE 0.11). V3 Stacking meta-learner: Added LightGBM to combine RoBERTa + TF-IDF + 30 engineered features (domain hit rates, author patterns, temporal encoding, word-level stats). The model was "cheating" by memorizing historical domain/author success rates from training data: - domain_hit_rate: 79.9% importance - roberta_prob: only 18.6% V4 – Return to simplicity: Pure RoBERTa, no stacking. Added isotonic calibration (your probability clustering problem, solved!). Current performance: - ROC AUC: 0.692 - ECE: 0.009 (excellent calibration) - Optimal threshold: 0.295 (not 0.5 – exactly as you documented)

What worked for me: Isotonic calibration: Your observation that most predictions cluster below 0.2 was spot-on. Isotonic regression produces well-distributed, meaningful probabilities. Aggressive threshold lowering: At ~10% base rate (posts hitting 100+ points), 0.5 threshold catches almost nothing useful. Pure transformer, no feature engineering: Contrary to intuition, adding TF-IDF and engineered features mostly added noise. The transformer handles the semantics well on its own.

What didn't worked for me: Focal Loss: Made the model too conservative (760 FN vs 219 FP) Domain/author features: Feature leakage, didn't generaliz Stacking: Added complexity without improving generalization

Your essay mentions achieving similar ROC AUC with logistic regression on bag-of-words features. A few things I'm curious about: Do you still maintain this system? Has your approach evolved since 2017? What was your experience with full-content vs title-only classification? I'm title-only currently, which has obvious limits. Any insights on the non-stationarity problem? Topic drift (Apple launches, security panics) seems like a persistent challenge. What made you choose logistic regression over neural approaches at the time? The simplicity seems to have served you well.