Comment by miki123211
15 hours ago
You conflate web crawling for inference with web crawling for training.
Web crawling for training is when you ingest content on a mass scale, usually indiscriminately, usually with a dumb crawler for scale's sake, for the purposes of training an LLM. You don't really care whether one particular website is in the dataset (unless it's the size of Reddit), you just want a large, diverse, high-quality data mix.
Web crawling for inference is when a user asks a targeted question, you do a web search, and fetch exactly those resources that are likely to be relevant to that search. Nothing ends up in the training data, it's just context enrichment.
People have a much larger issue with crawling for training than for inference (though I personally think both are equally ok).
No comments yet
Contribute on Hacker News ↗