← Back to context

Comment by pm215

2 days ago

I'm curious about whether there are well coded AI scrapers that have logic for "aha, this is a git forge, git clone it instead of scraping, and git fetch on a rescrape". Why are there apparently so many naive (but still coded to be massively parallel and botnet like, which is not naive in that aspect) crawlers out there?

I'm not an industry insider and not the source of this fact, but it's been previously stated that traffic costs to fetch the current data for each training run is cheaper then caching it in any way locally - wherever it's a git repo, static sites or any other content available through http

  • This seems nuts and suggests maybe the people selling AI scrapers their bandwidth could get away with charging rather more than they do :)

I'd see this as coming down to incentive. If you can scrape naively and it's cheap, what's the benefit to you in doing something more efficient for git forge? How many other edge cases are there where you could potentially save a little compute/bandwidth, but need to implement a whole other set of logic?

Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.

Another tangent: there probably are better behaved scrapers, we just don't notice them as much.

If they're handling it as “website, don't care” (because they're training on everything online) they won't know.

If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.

It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.

The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.

True, and it doesn't get mentioned enough. These supposedly world-changing advanced tech companies sure look sloppy as hell from here. There is no need for any of this scraping.