Comment by k__
2 days ago
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"
Great to read that!
2 days ago
"respecting web crawling opt-outs during data acquisition produces virtually no performance degradation"
Great to read that!
I wonder if the reason for these results is that any data on the internet is already copied to other locations by actors who ignore crawling opt-outs. So, even if they respect all web crawling opt-outs, they are still effectively copying the data because someone else did not respect it who does not include an opt-out.
Yes this is an interesting question. In our arxiv paper [1] we did study this for news articles, and also removed duplicates of articles (decontamination). We did not observe an impact on the downstream accuracy of the LLM, in the case of news data.
[1] https://arxiv.org/abs/2504.06219
My guess is that it doesn't remove that much of the data, and the post-training data (not just randomly scraped from the web) probably matters more
Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.
I understand the web is a dynamic thing but still it would seem to be useful on some level.
Common Crawl, maybe?
No performance degradation on training metrics except for the end user. At the end of the day users and website owners have completely orthogonal interests. Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master.
> Users want answers and content, website owners want attention so they can upsell/push ads. You can only serve one master
How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.
You don't. You bypass them with crawlers and don't reveal your training data. And this is exactly why open source models can't surpass open weight models.
10 replies →