Comment by jillesvangurp

11 hours ago

Building a simple text search engine isn't that hard. People show them off on HN on a fairly regular basis. Most of those are fairly primitive. Unfortunately building a good search engine isn't that straightforward. There's more to it than just implementing bm25 (the goto ranking algorithm), which you can vibe code in a few minutes these days. The reason this is easy is because this is nineties era research that is all well publicized and documented and not all that hard once you figure it out.

Building your own search engine is a nice exercise for understanding how search works. It gets you to the same level as a very long tail of "Elasticsearch alternatives" that really aren't coming even close to implementing a tiny percentage of its feature set. That can be useful as long as you are aware of what you are missing out on.

I've been consulting companies for a few years with going from in house coded solutions to something proper (typically Opensearch/Elasticsearch). Usually people fight themselves into a corner where their in house solution starts simple and then grows more complicated as they inevitably deal with ranking problems their users encounter. Usual symptoms: "it's slow" (they are doing silly shit with multiple queries against postgres or whatever), "it's returning the wrong things" (it turns out that trigrams aren't a one size fits all solution and returns false positives), etc. Add aggregations and other things to the mix and you basically have a perfect use case for Elasticsearch about 10 years ago before they started making it faster, smarter, and better.

The usual arguments against Elasticsearch & Opensearch:

"Elasticsearch/Opensearch are hard to run". Reality, there isn't a whole lot to configure these days. Yes you might want to take care of monitoring, backups, and a few other things. As you would with any server product. But it self configures mostly. Particularly, you shouldn't have to fiddle with heap settings, garbage collection, etc. The out of the box defaults work fine. Get a managed setup if all this scares you; those run with the same defaults typically. Honestly, running postgres is harder. There's way more to configure for that. Especially for high availability setups. The hardest part is sizing your vms correctly and making sure you don't blow through your limits by indexing too much data. Most of your optimizations are going to be at the index mapping level, not in the configuration.

"It's slow". That depends what you do and how you use it. Most of the simple alternatives have some hard limitations. If you under engineer your search (poor ranking, lots of false positives) it's probably going to be faster. That's what happens if you skip all the fancy algorithmic stuff that could make your search better. I've seen all the rookie mistakes that people make with Elasticsearch that impact performance. They are usually fairly easy to fix. (e.g. let's turn off dynamic mapping and not index all those text fields you never query on that fill up your disk and memory and bloat your indexing performance ...).

"I don't need all that fancy stuff". Yes you do. You just don't know it yet because you haven't figured out what's actually needed. Look, if your search isn't great and it doesn't matter, it's all fine. But if search quality matters and you lose user's interest when they fail to find stuff in your app/website it quickly can become an existential problem. Especially if you have competitors that do much better. That fancy stuff is what you would need to build to solve that.

Unless you employ some hard core search ranking experts, your internally crafted thing is probably not going to be great. If you can afford to run at ~2005 era state of the art (Lucene existed, SOLR & Elasticsearch did not, Lucene was fairly limited in scope), then go for it. But it's going to be quite limited when you need those extra features after all.

There are some nice search products out there other than Elasticsearch & Opensearch that I would consider fit for purpose; especially if you want to do vector search. And in fairness, using a search engine properly still requires a bit of skill. But that isn't any different if you do things yourself. Except it involves a lot less wheel reinvention.

There just is a bit of necessary complexity to building a good search product.

Seems like good advice, search has been built quite a few times now :-) I've defaulted to elasticsearch myself.

However, have you tried running any of the "up and coming" alternatives that keep showing up here? In particular, https://github.com/SeekStorm/SeekStorm seems very interesting, though I haven't heard from anyone using it in prod.

  • A red flag for me is that it lists stopword lists as a feature. Those went out of fashion in Lucene/Elasticsearch because of some non trivial but very effective caching and other optimizations around version 5.

    Stopwords are an old school optimization to deal with the problem of high frequency tokens when calculating rankings. Basically that means dealing with long lists of document ids that e.g. contain the word "to". This potentially has a lot of overhead. The solution is to eliminate the need to do so by focusing on ranking clauses with low frequency terms first and caching the results for the low frequency terms. You can eliminate a lot of documents by doing that. This gets rid of most of the overhead for low frequency terms without resorting to simply filtering them out.

    The key test here is queries that consist of stop words, like "to be or not to be" to find documents about Hamlet. If you filter out all the stop words, you are not going to get the right results on top.

    Just an example of where Seekstorm can probably do better. I have no direct experience with it though. So, maybe they do have a solution for that.

    But you should treat the need for stop word lists as a red flag for probably fairly immature takes on this problem space. Elasticsearch doesn't need those anymore. Also, what do stop word lists look like if you need multi lingual support? Who maintains these lists for different languages? Do you have language experts on your search team doing that for all the languages you need to support? People always forget about operational overhead like that. Stop word lists are fairly ineffective if you don't have people curating them and it creates obvious issues with certain queries.

> "I don't need all that fancy stuff". Yes you do.

> let's turn off dynamic mapping and not index all those text fields you never query on

what do you think about ManticoreSearch? It has been around longer than Lucene

  • I have no experience with ManticoreSearch but they've been around for a while. I think it migh be a Sphinx spinoff, this was a long abandoned solr like search engine written in C++ that they seem to have forked (correct me if I'm wrong). Mainly popular for some ecommerce use cases (as is the case with Solr). Looking at their front page I don't see any compelling reason to switch and a couple of things that I don't like

    - GPLv3 better than AGPLv3 in Elasticsearch but less permissive than Apache 2.0 in Opensearch.

    - They seem to emphasize being a drop in replacement a lot. Which raises the question: why not just stick with Opensearch.

    - I'm very skeptical of benchmarks in this space. Mostly they are apples and oranges comparisons. As I argued earlier it mainly raises the question what they are not doing or skipping. Barring major algorithmic improvements which Lucene developers could just copy if it's valid, I don't see how they could be better/faster. And Lucene is of course well known to be heavily optimized and still squeezing out a lot of performance from release to release. Progress has been pretty substantial in v8 and v9 in recent years.

    Other than that they seem to know what they are doing is the best I can say about it.

    • Yeah, it forked from sphinx when sphinx died or rugpulled or something like that. Seems to be very actively maintained.

      I think one of the main draws is that it is a single binary rather than the complexity of ES/OS, JVM etc... And if you have a mysql/mariadb database, it just connects and automatically ingests extremely quickly. They also use Galera for replication, but I also think its not as explicitly shareded, which simplifies things.

      Yeah, their benchmarks are astounding, so much so that it is hard to believe. yet, I have seen them be quite open to feedback, collaboration etc so

      Anyway, thanks for your thoughts and insights!