Comment by azornathogron

1 month ago

Is crawling really solved?

Any naive crawler is going to run into the problem that servers can give different responses to different clients which means you can show the crawler something different to what you show real users. That turns crawling into an antagonistic problem where the crawler developers need to continually be on the lookout for new ways of servers doing malicious things that poison/mislead the index.

Otherwise you'll return junk spam results from spammers that lied to the crawler.

I've never done it so maybe it's easier than I imagine but I wouldn't be quick to assume that crawling is solved.

2 comments

azornathogron

senko 1 month ago

I don't mean to say it's trivial. I'm sure there are many hard problems such as the one you mention - though that particular one is more "cleaning the index" part which might work on top of the open common corpus.

But my impression is that it's more a question of scale and engineering time than having to invent something new.

(disclaimer: I also never worked on a internet-scale search system, maybe I'm very off the bat here as well).

azornathogron 1 month ago

Oh, ok. I misunderstood - I think we agree.