Comment by jillesvangurp

8 hours ago

A red flag for me is that it lists stopword lists as a feature. Those went out of fashion in Lucene/Elasticsearch because of some non trivial but very effective caching and other optimizations around version 5.

Stopwords are an old school optimization to deal with the problem of high frequency tokens when calculating rankings. Basically that means dealing with long lists of document ids that e.g. contain the word "to". This potentially has a lot of overhead. The solution is to eliminate the need to do so by focusing on ranking clauses with low frequency terms first and caching the results for the low frequency terms. You can eliminate a lot of documents by doing that. This gets rid of most of the overhead for low frequency terms without resorting to simply filtering them out.

The key test here is queries that consist of stop words, like "to be or not to be" to find documents about Hamlet. If you filter out all the stop words, you are not going to get the right results on top.

Just an example of where Seekstorm can probably do better. I have no direct experience with it though. So, maybe they do have a solution for that.

But you should treat the need for stop word lists as a red flag for probably fairly immature takes on this problem space. Elasticsearch doesn't need those anymore. Also, what do stop word lists look like if you need multi lingual support? Who maintains these lists for different languages? Do you have language experts on your search team doing that for all the languages you need to support? People always forget about operational overhead like that. Stop word lists are fairly ineffective if you don't have people curating them and it creates obvious issues with certain queries.

The stopword list in SeekStorm is purely optional, per default it is empty.

The query "to be or not to be" that you mentioned, consisting solely of stopwords, returns complete results and perform quite well in the benchmark: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#be...

Both Lucene and Elastic still offer stopword filters: https://lucene.apache.org/core/10_3_2/analysis/common/org/ap... https://www.elastic.co/docs/reference/text-analysis/analysis...