← Back to context

Comment by maleldil

7 months ago

Isn't Marginalia playing a completely different game from Kagi? AFAIK, Marginalia isn't trying to be a general-purpose search engine.

PS: Lovely username =)

(1) Marginalia can get away with it because it is searching a smaller collection over which it is easier to manage spam. On the other hand, Matt Cutts became a hero at Google not because he built models for filtering unwanted content but because he figured out how to motivate people to make the labels to train that sort of model.

(2) One of the most depressing experiences of my life was reading through the first ten years or so of TREC conferences looking for something useful to improve the search engines I was building. Eventually I found a volume that revealed the handful of useful results that they got in the first ten years (here https://mitpress.mit.edu/9780262220736/trec/)

Advances in search quality are rare and come along about once a decade; BM25 was such an advance, on paper. Even though BM25 is in Elasticsearch and a lot of other products very few people are taking advantage of it because they don't want to do the parametric tuning it requires to get superior results.

https://sbert.net/

is a similar once-in-decade advance that actually works out of the box with relatively little tuning. It doesn't address all the issues of search and should be integrated with more traditional search, but if you are building a search engine in 2024 you can expect to wait another 10 years for another advance like that.

(3) Marginalia particularly interests me because it is a small collection and the problems of search over a small collection are very different from those over a large collection. Gerald Salton started IR research with a deck of punch cards and he thought 80 documents was a lot and with 80 documents you are going to be very concerned about missing relevant documents because you didn't pick the right word. If you have 80,000,000,000 documents you have a very different problem. My take is SBERT and related techniques are particularly effective against small collection problems.