Comment by renegat0x0
16 hours ago
Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.
People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?
I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].
On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.
Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].
Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.
[0] https://github.com/rumca-js/crawler-buddy
> Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels
Exactly, why can't we just hoard our bookmarks and a list of curated sources, say 1M or 10M small search stubs, and have a LLM direct the scraping operation?
The idea is to have starting points for a scraper, such as blogs, awesome lists, specialized search engines, news sites, docs, etc. On a given query the model only needs a few starting points to find fresh information. Hosting a few GB of compact search stubs could go a long way towards search independence.
This could mean replacing Google. You can even go fully local with local LLM + code sandbox + search stub index + scraper.
Marginalia Search does something like this
+1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.
Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.
If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.
I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/
When I saw the Internet-Places-Database I thought it was an index on some sort of PoI and I got curious. But the personal internet spiel is pretty cool. One good addition to this could be the Foursquare PoI dataset for places search: https://opensource.foursquare.com/os-places/