← Back to context

Comment by shevy-java

10 hours ago

Alright but why do we not have more search engines that are actually good?

I'd love to cut myself off from Google, including Google Search, but any alternatives manage to be even worse. Consistently so. It's as if Google won the war by being just permanently slightly better - while everyone is actually really crap. That wasn't the case, say, 10 years ago or so.

Not all search is web-wide search. The best-known example of this is probably Amazon's search bar. No one really wants to search Amazon via Google. They have staffers contributing heavily to Lucene.

But also there are all kinds of other applications. Let's say you run a reviews site; you can build a bespoke power search form allowing people to sort on things like price, date of review, set a minimum star threshhold, etc. You can also weigh product names or review titles more heavily in the index scoring (a review /of/ the Pixel 10 should rank higher than a review that mentions the Pixel 10 prominently).

Even being able to sort results of searching blog posts or other dated content by date is powerful - Google can only guess at the actual dates of those posts. You can search with required tags, or weigh tags more heavily in result scoring. You can put your finger on the scale and say, effectively, post A should always rank more highly than post B for term X.

Also, site operators know traffic/popularity, which internet search engine can only sort of guess at, and can use this to score/sort. Amazon clearly does this.

For some reason a lot of web devs seem to think search is this really hard problem. But once you learn the basics of how it works, and if you use a library like Lucene, it does not need to be hard at all. Mostly you just have to be strategic and consistent about where and when you index and deindex content, it's usually alongside your db persistence calls. Once it's running you optimize by sprinkling some minimum amount of magic on your scoring setup to make it worthwhile/differentiated from Google.

Because it's not a simple problem space. Lucene has gone through about three decades of lots of optimization, feature development, and performance tuning. A lot of brain power goes into that.

Google bootstrapped the AI revolution as a side effect of figuring out how to do search better. They started by hiring a lot of expert researchers that then got busy iterating on interesting search engine adjacent problems (like figuring out synonyms, translations, etc.). In the process they got into running neural networks at scale, figuring out how to leverage GPUs and eventually building their own TPUs.

The Acquire Podcast recently did a great job of outlining the history of Google & Alphabet.

Doing search properly at scale mainly requires a lot of infrastructure. And that's Google's real moat. They get to pay for all that with an advertising money printing machine. Which BTW. leverages a lot of search algorithms. Matching advertisements to content is a search problem. Google just got really good at that. That's what finances all the innovation in this space from deep learning to TPUs. Being able to throw a few hundred million at running some experiments is what makes the difference here.

I use the '4get' proxy search engine, which lets you use pretty much every search engine under the sun, for both websites and images. It's really useful because it is faster than google, and if you need to find some pages you can just change the search engine quickly.

It is open source and there are many instances available, I use '4get.bloat.cat' or '4get.lunar.iu'

It is a better alternative to SearX for sure

  • I checked the about page on 4get.bloat.cat, and within the first paragraph of the "what is this" section, it used the phrase "globohomo bullshit". I dont think these are people I want to support.

These days I default to DDG. Not because it's improved but because Google's results are just that bad. Even a couple years ago I was reaching for Google with a lot more frequency.