Comment by PaulHoule

4 days ago

Negative posts that I post tend to do better than neutral or positive ones. I have a classifier that judges titles on "most likely to get upvoted" for which "Richard Stallman is Dead" is the optimal title, and another that judges on "likely to have a comments/vote ratio > 0.5" [1]. The first one is a crummy model in terms of ROC, the second is pretty good and favors things that are clickbaity, about the battle of the sexes, and oddly, about cars.

But that 35 as an average score is hard for me to believe at first, I mean, the median HN post gets no votes, last time I looked the mean was around 8 or so. What is he sampling from?

[1] comments/votes = 0.5 is close to the mean

15 comments

PaulHoule

7777777phil 3 days ago

Hi, appreciate your comment. The sampling is from all posts / comments over the past 35 days, accessed via the API (https://github.com/philippdubach/hn-archiver). There might be a skew to sample higher voted posts first (i.e. if there is high volume posts and comments with zero upvotes don't make it into the database) so that would explain the high ration. I will definitely look into it before publishing the paper - this is exactly the feedback I was hoping for publishing the preprint. Thanks for pointing this out! Would love to see the mentioned classifier. If you find the time please reach out to the email on the page or on bluesky.

osakasake 3 days ago
This is factually incorrect. There’s no way that you are sampling ALL posts and comments because otherwise the average would not be 35 points. The vast majority of posts get no upvotes.
In addition, comments do not show the points accumulated so there’s no way you can know how many points a comment gets, only posts.
- 7777777phil 3 days ago
  
  Thanks for the pushback this is exactly the kind of peer review I was hoping for at the preprint stage. You are likely correct regarding the sampling bias. While the intent was to capture all. posts, an average score of 35 suggests that my archiver missed a significant portion of the zero-vote posts (likely due to my workers API rate limits or churn during high-volume periods). This created a survivorship bias toward popular posts in the current dataset, which I will explicitly address and correct.
  To clarify on the second point: I am not analyzing individual comment scores (which, as you noted, are hidden). The metric refers to post points relative to comment growth/volume. I will be updating the methodology section to reflect these limitations. The full code and dataset will be open-sourced with the final publication so the sampling can be fully audited. Appreciate the rigor.
  
  5 replies →

pjc50 3 days ago

> "most likely to get upvoted" for which "Richard Stallman is Dead" is the optimal title

This is extremely funny, and reminds me of the famous newspaper headline "Generalissimo Francisco Franco Is Still Dead". Of course, at time of writing, RMS is still alive and the optimal headline is a falsehood..

PaulHoule 3 days ago
My system uses logistic regression on words and it thinks that HN (1) really likes Richard Stallman and (2) really likes obituaries so put them together and that headline gets a great score.
I bet if it was put in as "fake news" it would get hundreds of votes and comments before dang took it down. And when it does happen for real it will certainly get 1000s votes.
- Kye 3 days ago
  
  For example: my #2 submission https://news.ycombinator.com/item?id=38468326
TeMPOraL 3 days ago

> Of course, at time of writing, RMS is still alive and the optimal headline is a falsehood..
That's where Betteridge's law of headlines comes to the rescue! Just rephrase the headline as a question - "Is Richard Stallman dead?".

nonameiguess 3 days ago

Sorry to get both meta and personal, but I'm kind of curious because you're one of the few here whose name I instantly recognize, probably because I'm fairly interested in science and my impression is you mostly post scientific papers or articles discussing them. I'm looking at your profile of submissions now and the first page is 30 submissions all made in the last 24 hours. Most of them are indeed scientific papers. My own experience reading material like this is it generally takes at minimum 5-6 hours to read a paper and meaningfully digest any of it, and that's only true of subjects I'm somewhat familiar with. For subjects I'm not familiar with, there is rarely any point in reading direct research at all. Given you can't possibly be reading all of this, what is your motivation for submitting all of it to Hacker News? What is your process for finding this material and identifying it as interesting?

PaulHoule 3 days ago

(1) Answering "what is my motivation?" isn't simple because I got into this slowly. I really enjoyed participating in HN, around the time my karma reached 4000 I started getting competitive about it, around 20,000 I started developing automation.
When I helped write
https://www.pnas.org/doi/10.1073/pnas.0308253100
in 2004 I thought text classification was a remarkably mature technology which was under-used. In particularly I thought there was no imagination in RSS reader interfaces and thought an RSS reader with an algorithmic feed. That December when Musk bought Twitter this was still on my mind and I made it happen and the result was the YOShInOn RSS reader [1] and I thought building it around a workflow where I select articles for my own interest and post some on HN was a good north star. [2]
It is self tuning and soldiers on despite changes in the input and how much time I vote to it. It spins like a top and I've only patched it twice in the last year.
Anything that gets posted to HN is selected once by the algorithm and twice by me. Reducing latency is a real goal, improving quality is a hypothetical goal, either of those involves some deep thinking about "what does quality mean?" and threatens the self tuning and "spins like a top".
My interest in it is flagging lately because of new projects I am working on, I am worried though that if I quit doing it people will wonder if something happened to me because that happened when Tomte went dark.
(2) I'll argue that scientific papers are better and worse than you say they are. Sometimes an abstract or an image tells a good story story, arguably a paper shouldn't get published. I think effective selection and ranking processes are a pyramid and I am happy to have the HN community make the decision about things. On the other hand, I've spent 6 months (not full time) wrangling with a paper and then come back 6 years later and come to see I got it wrong the first time.
I worked at arXiv a long time ago and we talked a lot about bibliometrics and other ways to judge the quality of scientific work and the clearest thing is that it would take a long time like not 4-5 hours of an individual but more like several years (maybe decades!) of many, many people working at it -- consider the example of the Higgs Boson!
Many of the papers that I post were found in the RSS feed of phys.org, if they weren't working overtime to annoy people with annoying ads I would post more links to phys.org and less to papers. I do respect the selection effort they make and often they rewrite the title "We measured something with" to "Scientists discovered something important" and sometimes they explain papers well but unfortunately "voice" won't get them to reform their self-destructive advertising.
I could ramble on a lot more and I really ought to write this up somewhere off HN but I will just open the floor to questions if you have any.
[1] search for it in the box at the bottom of the page
[2] pay attention if you struggle to complete side projects!

7777777phil 12 hours ago

Sent you a message to the mail in your bio but I thought I also leave this here: Again, thank you for your comments. I went down a rabbit hole reading your essays Hacker News For Hackers and Classifying Hacker News Titles.

I've also been building an RSS reader with integrated HN success prediction – essentially trying to surface which articles from my feed are most likely to resonate on HN before I submit them. Your research directly informed several decisions, so I wanted to share what I've learned and ask for your insights.

V1 DistilBERT baseline: Started with a fine-tuned DistilBERT on ~15k HN titles. Achieved ROC AUC of 0.77, which felt promising. Clean architecture, simple training loop. V2 Added complexity: Switched to RoBERTa + TF-IDF ensemble, thinking more features = better. ROC AUC dropped to 0.70. Precision suffered. Calibration was poor (ECE 0.11). V3 Stacking meta-learner: Added LightGBM to combine RoBERTa + TF-IDF + 30 engineered features (domain hit rates, author patterns, temporal encoding, word-level stats). The model was "cheating" by memorizing historical domain/author success rates from training data: - domain_hit_rate: 79.9% importance - roberta_prob: only 18.6% V4 – Return to simplicity: Pure RoBERTa, no stacking. Added isotonic calibration (your probability clustering problem, solved!). Current performance: - ROC AUC: 0.692 - ECE: 0.009 (excellent calibration) - Optimal threshold: 0.295 (not 0.5 – exactly as you documented)

What worked for me: Isotonic calibration: Your observation that most predictions cluster below 0.2 was spot-on. Isotonic regression produces well-distributed, meaningful probabilities. Aggressive threshold lowering: At ~10% base rate (posts hitting 100+ points), 0.5 threshold catches almost nothing useful. Pure transformer, no feature engineering: Contrary to intuition, adding TF-IDF and engineered features mostly added noise. The transformer handles the semantics well on its own.

What didn't worked for me: Focal Loss: Made the model too conservative (760 FN vs 219 FP) Domain/author features: Feature leakage, didn't generaliz Stacking: Added complexity without improving generalization

Your essay mentions achieving similar ROC AUC with logistic regression on bag-of-words features. A few things I'm curious about: Do you still maintain this system? Has your approach evolved since 2017? What was your experience with full-content vs title-only classification? I'm title-only currently, which has obvious limits. Any insights on the non-stationarity problem? Topic drift (Apple launches, security panics) seems like a persistent challenge. What made you choose logistic regression over neural approaches at the time? The simplicity seems to have served you well.