Comment by TonyStr

24 days ago

Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

23 comments

TonyStr

tonnydourado 24 days ago

Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)

Phelinofist 24 days ago

I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.

Phelinofist 23 days ago
For reference, this is how I do it in my Caddyfile:
(block_ai) { @ai_bots { header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot) } abort @ai_bots }
Then, in a specific app block include it via
import block_ai
- seba_dos1 23 days ago
  
  Most of then pretend to be real users though and don't identify themselves with their user agent strings.
- zaphar 23 days ago
  
  I have almost exactly this in my own caddyfile :-D The order of the items in the regex is a little different but mostly the same items. I just pulled them from my web access logs over time and update it every once in a while.
Zambyte 23 days ago
i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code
blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane
- MarsIronPI 23 days ago
  
  Have you considered putting it behind Anubis or an equivalent?
  
  2 replies →

nerdponx 24 days ago

Time to start including deliberate bugs. The correct version is in a private repository.

teiferer 23 days ago
And what purpose would this serve, exactly?
- adastra22 23 days ago
  
  Spite.
below43 23 days ago

They used to do this with maps - eg. fake islands - to pick up when they were copied.
program_whiz 23 days ago
while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."
A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(
- nerdponx 23 days ago
  
  I think if we're at the point where posting deliberate mistakes to poison training data is considered a crime, we would be far far far down the path of authoritarian corporate regulatory capture, much farther than we are now (fortunately).
- wredcoll 23 days ago
  
  Look, I get the fantasy of someday pulling out my musket^W ar15 and rushing downstairs to blow away my wife^W an evil intruder, but, like, we live in a society. And it has a lot of benefits, but it does mean you don't get to be "king of your castle" any more.
  Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.
  There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.
  None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.
  
  2 replies →

0x696C6961 23 days ago

This has been happening before LLMs too.

teiferer 23 days ago

I don't really get why they need to clone in order to scrape ...?

> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)

The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.

storystarling 23 days ago
Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.
- teiferer 23 days ago
  
  Sure, cloning a local copy. But why clone on github?
adastra22 23 days ago

The quality of LLM coding agents is pretty good now.