Comment by SyneRyder
18 days ago
+1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.
Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.
If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.
I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/
Do you have any documentation/blog post for this? I would love to do something similar for my own use.
Unfortunately I still haven't had a chance to make a blog post about this, which I really must do. But I can give some quick hints. Anyone reading this, feel free to reach out & I can try to answer questions, and they might help my blog post too.
I started off with a meta-search calling out to Brave / Mojeek / Marginalia, and the basics of that are something that you can ask an AI to make for you as a one-file PHP script. I still think this is a good place to start, because you'll quickly find "okay, I can replace my everyday search engine with this". Once you're dogfooding your engine every day, you'll notice all the rough points you want to improve.
Once you've got an array of objects with Title, URL, Description, and splitting the URL into domain, TLD, subdomain, path, file extension... there's a lot of ranking you can apply just to those. Honestly, a lot of my "ranking" has just been applying increased rankings to domains that I visit most often. I have an array of about 600 domains that it applies ranking boosts to. You can try experimenting with your re-ranking there, before even starting to build your own index.
As for building your own (small, personal) index, the technical details are not as difficult as you'd think. An SQLite database file, that your PHP file reads, will take you a long way... especially if you enable FTS5 indexing. I only did that last week, and I should have done that at the beginning. Search times are 10ms, and not just on my personally curated index of 80,000 pages... I just added a 2nd database with 1.3 Million entries from DMOZ (the old Mozilla Directory), and it's still only about 10ms. My search engine now feels super fast when it gets results from my database. And when it finds zero results, it automatically falls back to the metasearch.
At 1.3 Million entries, the two databases are only about 550MB total. It's running on a shared hosting account and apparently they're not worried - but it's only available to me, so I'm only hitting it maybe 50 times a day maximum. I'll move it onto a VPS eventually, but every time I think "this must be using up too many resources", I find I'm thinking too small by at least a factor of 10x.
For getting started with PHP & SQLite, I found this blog post helpful - but at this point, your AI can vibe code the entire thing for you:
https://davidejones.com/blog/simple-site-search-engine-php-s...
It's amazing how far you can get with just SQLite and FTS5 and a little PHP. Read the Marginalia blog too, there's so much good information in there.
Don't hold yourself back, don't think it's impossible.