Comment by marginalia_nu

9 hours ago

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

19 comments

marginalia_nu

zipy124 3 hours ago

I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.

zppln 3 hours ago
I feel at this point you'd almost be better off hand-curating a set of domains and only crawl those.
- skeeter2020 29 minutes ago
  
  not sure if this was intentional, but everything old is new again; back to OH yahoo? or Craig's list?
  
  1 reply →
jayd16 2 hours ago

I wonder how hard it is when mice are not paying the cat to serve ads.
HEmanZ 1 hour ago
There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.
Edit: And I’m getting downvoted for this. If it’s because I am tangential to the original comment then that’s fair. If it’s because you think I’m wrong, I have worked on the two largest internet search engines in the world and one non-internet search engine that dwarfed both in size (although different in complexity).

djoldman 5 hours ago

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

Large amounts of data seem obviously difficult.

For your second difficulty, "handling underspecified queries": it seems to me that's a subset of the problem of, "given a query, what are the most relevant results?" That problem seems very tricky, partially because there is no exact true answer.

marginalia search is great as a contrast to engines like google, in part because google chooses to display advertisements as the most relevant results.

Have you found any of the TREC papers helpful?

https://trec.nist.gov/

mapt 6 hours ago

What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?

marginalia_nu 4 hours ago

If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.
If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.
[1] primary key tables aren't really useful here.

HelloUsername 7 hours ago

I love your https://marginalia-search.com :)

marginalia_nu 7 hours ago
"Building A Complex Search Engine That Works Sometimes"
- moffkalast 5 hours ago
  
  15% of the time it works every time.

gcanyon 6 hours ago

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

I would expect the difficulty to be deciding which item to return when there are multiple that contain the search term. Is wikipedia's article on Gilligan's Island better than some guy's blog post? Or is that guy a fanatic who has spent his entire life pondering whether Wrongway Feldman was malicious or how Irving met Bingo Bango and Bongo?

Add in rank hacking, keyword stuffing, etc. and it seems like a very hard problem, while scaling... is scaling? ¯\_(ツ)_/¯

marginalia_nu 6 hours ago

That would be the "handling underspecified queries" thing I mentioned.
dumbfounder 6 hours ago
Elastic and many others fail to solve this problem too. There are many different strategies and many of them require ingenuity and development.
- jonstewart 5 hours ago
  
  It’s not like ElasticSearch lacks ranking algorithms and control thereof. But it can require tuning and adjustment for various domains. Relevancy is, after all, subjective.

submeta 8 hours ago

Thank you very much for the recommendation. I am in the process of building knowledge base bots, and am confronted with the task of creating various crawlers for the different sources the company has. And this book comes in very handy.

SenanG 7 hours ago

[dead]