← Back to context

Comment by marginalia_nu

9 hours ago

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.

  • There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.

    Edit: And I’m getting downvoted for this. If it’s because I am tangential to the original comment then that’s fair. If it’s because you think I’m wrong, I have worked on the two largest internet search engines in the world and one non-internet search engine that dwarfed both in size (although different in complexity).

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

Large amounts of data seem obviously difficult.

For your second difficulty, "handling underspecified queries": it seems to me that's a subset of the problem of, "given a query, what are the most relevant results?" That problem seems very tricky, partially because there is no exact true answer.

marginalia search is great as a contrast to engines like google, in part because google chooses to display advertisements as the most relevant results.

Have you found any of the TREC papers helpful?

https://trec.nist.gov/

What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?

  • If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.

    If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.

    [1] primary key tables aren't really useful here.

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

I would expect the difficulty to be deciding which item to return when there are multiple that contain the search term. Is wikipedia's article on Gilligan's Island better than some guy's blog post? Or is that guy a fanatic who has spent his entire life pondering whether Wrongway Feldman was malicious or how Irving met Bingo Bango and Bongo?

Add in rank hacking, keyword stuffing, etc. and it seems like a very hard problem, while scaling... is scaling? ¯\_(ツ)_/¯

  • Elastic and many others fail to solve this problem too. There are many different strategies and many of them require ingenuity and development.

    • It’s not like ElasticSearch lacks ranking algorithms and control thereof. But it can require tuning and adjustment for various domains. Relevancy is, after all, subjective.

Thank you very much for the recommendation. I am in the process of building knowledge base bots, and am confronted with the task of creating various crawlers for the different sources the company has. And this book comes in very handy.