Comment by renegat0x0
19 days ago
Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.
People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?
I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].
On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.
Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].
Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.
[0] https://github.com/rumca-js/crawler-buddy
+1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.
Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.
If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.
I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/
Do you have any documentation/blog post for this? I would love to do something similar for my own use.
Unfortunately I still haven't had a chance to make a blog post about this, which I really must do. But I can give some quick hints. Anyone reading this, feel free to reach out & I can try to answer questions, and they might help my blog post too.
I started off with a meta-search calling out to Brave / Mojeek / Marginalia, and the basics of that are something that you can ask an AI to make for you as a one-file PHP script. I still think this is a good place to start, because you'll quickly find "okay, I can replace my everyday search engine with this". Once you're dogfooding your engine every day, you'll notice all the rough points you want to improve.
Once you've got an array of objects with Title, URL, Description, and splitting the URL into domain, TLD, subdomain, path, file extension... there's a lot of ranking you can apply just to those. Honestly, a lot of my "ranking" has just been applying increased rankings to domains that I visit most often. I have an array of about 600 domains that it applies ranking boosts to. You can try experimenting with your re-ranking there, before even starting to build your own index.
As for building your own (small, personal) index, the technical details are not as difficult as you'd think. An SQLite database file, that your PHP file reads, will take you a long way... especially if you enable FTS5 indexing. I only did that last week, and I should have done that at the beginning. Search times are 10ms, and not just on my personally curated index of 80,000 pages... I just added a 2nd database with 1.3 Million entries from DMOZ (the old Mozilla Directory), and it's still only about 10ms. My search engine now feels super fast when it gets results from my database. And when it finds zero results, it automatically falls back to the metasearch.
At 1.3 Million entries, the two databases are only about 550MB total. It's running on a shared hosting account and apparently they're not worried - but it's only available to me, so I'm only hitting it maybe 50 times a day maximum. I'll move it onto a VPS eventually, but every time I think "this must be using up too many resources", I find I'm thinking too small by at least a factor of 10x.
For getting started with PHP & SQLite, I found this blog post helpful - but at this point, your AI can vibe code the entire thing for you:
https://davidejones.com/blog/simple-site-search-engine-php-s...
It's amazing how far you can get with just SQLite and FTS5 and a little PHP. Read the Marginalia blog too, there's so much good information in there.
Don't hold yourself back, don't think it's impossible.
> Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels
Exactly, why can't we just hoard our bookmarks and a list of curated sources, say 1M or 10M small search stubs, and have a LLM direct the scraping operation?
The idea is to have starting points for a scraper, such as blogs, awesome lists, specialized search engines, news sites, docs, etc. On a given query the model only needs a few starting points to find fresh information. Hosting a few GB of compact search stubs could go a long way towards search independence.
This could mean replacing Google. You can even go fully local with local LLM + code sandbox + search stub index + scraper.
Marginalia Search does something like this
When I saw the Internet-Places-Database I thought it was an index on some sort of PoI and I got curious. But the personal internet spiel is pretty cool. One good addition to this could be the Foursquare PoI dataset for places search: https://opensource.foursquare.com/os-places/