Comment by sodafountan

1 month ago

The GitHub page is no longer available, which is a shame because I'm really interested in how this works.

How was the entirety of HN stored in a single SQLite database? In other words, how was the data acquired? And how does the page load instantly if there's 22GB of data having to be downloaded to the browser?

2 comments

sodafountan

keepamovin 1 month ago

You can see it now, forgot to make it public.

- 1. download_hn.sh - bash script that queries BigQuery and saves the data to *.json.gz

- 2. etl-hn.js - does the sharding and ID -> shard map, plus the user stats shards.

- 3. Then either npx serve docs or upload to CloudFlare Pages.

The ./toool/s/predeploy-checks.sh script basically runs the entire pipeline. You can do it unattended with AUTO_RUN=true

sodafountan 1 month ago

Awesome, I'll take a look