Comment by nostrademons

10 hours ago

It's actually not that hard now, once you get useful content. When I worked on Search (~2009ish), the primary index was called 4BBase, because it was the top 4 billion webpages (actually more like 5.5B during my time, but it had been around for a few years). A typical webpage is about 100K, and HTML compresses at 80-90% compression rates, so you're looking at 10-20K/page. The index would take about 50-100 TB.

Even after the recent AI run-up, disk prices are about $20/TB for a 20TB, so you can store this index on 3-5 hard disks that will cost you about $1200-2000. For self-hosted use you don't need to serve them in 50ms, so you don't need to put the whole thing in RAM like Google did, you can serve off of disk.

ElasticSearch uses basically the same data structures and gives you the same infrastructure that Google's ~late-00s search stack did, and is actually more advanced in some respects (like ad-hoc queries, debuggability, and updateability), so software isn't much of an issue.

The big part missing that can't really be replicated today is the huge web of authentic hyperlinks. The reason Google was so good at search was because many humans effectively "tagged" a given webpage with a series of short, descriptive words and phrases. When they went to search for a page, Google could mine this huge treasure trove of backlinks to identify exactly what the page was good for, even if those search terms never appeared on the page. SEO and link farms kinda killed this, as did the rise of social media walled gardens, and so the Google of 2009 basically wouldn't work today anyway. Maybe if you pulled old versions of Common Crawl or archive.org you could reconstruct it, but the relevant pages are often offline anyway today.

1 comment

nostrademons

opengrass 5 hours ago

If an ex Googler compares Elastic Search to the old company then it mustbe something good.