← Back to context

Comment by 7777777phil

4 days ago

Thanks for the pushback this is exactly the kind of peer review I was hoping for at the preprint stage. You are likely correct regarding the sampling bias. While the intent was to capture all. posts, an average score of 35 suggests that my archiver missed a significant portion of the zero-vote posts (likely due to my workers API rate limits or churn during high-volume periods). This created a survivorship bias toward popular posts in the current dataset, which I will explicitly address and correct.

To clarify on the second point: I am not analyzing individual comment scores (which, as you noted, are hidden). The metric refers to post points relative to comment growth/volume. I will be updating the methodology section to reflect these limitations. The full code and dataset will be open-sourced with the final publication so the sampling can be fully audited. Appreciate the rigor.

Interestingly, this is the kind of negative feedback that your post implies is bad. Thank goodness for negative feedback!

If you want some more feedback, why are you using Cloudflare workers that presumably cost you money? You can retrieve all of the HN content with a regular PC pretty easily. I’m talking a single core with a python program and minimal RAM.

  • You're right that a simple Python script would be more cost-effective for this kind of archiving. I went with workers because I was already familiar with the stack and wanted real-time processing, but for a research project focused on completeness rather than latency, your approach makes much more sense - please reach out if you want to offer your help. Initially I was planning on building a public realtime dashboard and might as well still do.