Comment by 7777777phil

4 days ago

Hi, appreciate your comment. The sampling is from all posts / comments over the past 35 days, accessed via the API (https://github.com/philippdubach/hn-archiver). There might be a skew to sample higher voted posts first (i.e. if there is high volume posts and comments with zero upvotes don't make it into the database) so that would explain the high ration. I will definitely look into it before publishing the paper - this is exactly the feedback I was hoping for publishing the preprint. Thanks for pointing this out! Would love to see the mentioned classifier. If you find the time please reach out to the email on the page or on bluesky.

This is factually incorrect. There’s no way that you are sampling ALL posts and comments because otherwise the average would not be 35 points. The vast majority of posts get no upvotes.

In addition, comments do not show the points accumulated so there’s no way you can know how many points a comment gets, only posts.

  • Thanks for the pushback this is exactly the kind of peer review I was hoping for at the preprint stage. You are likely correct regarding the sampling bias. While the intent was to capture all. posts, an average score of 35 suggests that my archiver missed a significant portion of the zero-vote posts (likely due to my workers API rate limits or churn during high-volume periods). This created a survivorship bias toward popular posts in the current dataset, which I will explicitly address and correct.

    To clarify on the second point: I am not analyzing individual comment scores (which, as you noted, are hidden). The metric refers to post points relative to comment growth/volume. I will be updating the methodology section to reflect these limitations. The full code and dataset will be open-sourced with the final publication so the sampling can be fully audited. Appreciate the rigor.

    • If you want some more feedback, why are you using Cloudflare workers that presumably cost you money? You can retrieve all of the HN content with a regular PC pretty easily. I’m talking a single core with a python program and minimal RAM.

      1 reply →