← Back to context

Comment by TonyStr

7 hours ago

Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)

I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.

  • For reference, this is how I do it in my Caddyfile:

       (block_ai) {
           @ai_bots {
               header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot)
           }
    
           abort @ai_bots
       }
    

    Then, in a specific app block include it via

       import block_ai

    • I have almost exactly this in my own caddyfile :-D The order of the items in the regex is a little different but mostly the same items. I just pulled them from my web access logs over time and update it every once in a while.

  • i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code

    blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane

Time to start including deliberate bugs. The correct version is in a private repository.

  • while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."

    A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(

    • I think if we're at the point where posting deliberate mistakes to poison training data is considered a crime, we would be far far far down the path of authoritarian corporate regulatory capture, much farther than we are now (fortunately).

    • Look, I get the fantasy of someday pulling out my musket^W ar15 and rushing downstairs to blow away my wife^W an evil intruder, but, like, we live in a society. And it has a lot of benefits, but it does mean you don't get to be "king of your castle" any more.

      Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.

      There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.

      None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.

I don't really get why they need to clone in order to scrape ...?

> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)

The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.

  • Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.