Comment by jkarneges

4 months ago

The HN/Firebase API doesn't make this easy. For https://hnstream.com I ended up crawling items to find the article.

8 comments

jkarneges

Any tips on respectfully crawling HN so you don’t get throttled? I had an application idea that could not be served by the API (need karma values) so I started to write code to scrape but got rate limited pretty quickly.

jkarneges 4 months ago

I've had no trouble hitting the Firebase API at the speed items are created, with a 5 second delay between retries.
For scraping HN directly, in my experience you have to go extremely slow, like 1 minute between fetching items. And if you get blocked, it may be better to wait a long time (minutes) before trying again rather than exponential backoff, in order to get out of the penalty box. You'll need a cache for sure.

esafak 4 months ago

The comments don't even have a thread ID?

zamadatix 4 months ago
Comment items look like https://hacker-news.firebaseio.com/v0/item/45533616.json?pri...:
{ "by" : "jkarneges", "id" : 45533018, "kids" : [ 45533616 ], "parent" : 45532549, "text" : "The HN/Firebase API doesn't make this easy. For <a href=\"https://hnstream.com\" rel=\"nofollow\">https://hnstream.com</a> I ended up crawling items to find the article.", "time" : 1760043552, "type" : "comment" }
"parent" can either be the actual parent comment or the parent article, depending where in the comment chain you are.
- esafak 4 months ago
  
  Perhaps @kogir, who was active on https://github.com/HackerNews/API could add the thread id.
- smusamashah 4 months ago
  
  https://jaytaylor.github.io/hn-live2 is doing it though.
  
  2 replies →