Comment by teiferer

12 days ago

If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.

Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

53 comments

teiferer

TonyStr 12 days ago

Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

tonnydourado 12 days ago

Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)
Phelinofist 12 days ago
I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.
- Phelinofist 12 days ago
  
  For reference, this is how I do it in my Caddyfile:
  (block_ai) { @ai_bots { header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot) } abort @ai_bots }
  Then, in a specific app block include it via
  import block_ai
  
  2 replies →
- Zambyte 12 days ago
  
  i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code
  blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane
  
  3 replies →
nerdponx 12 days ago
Time to start including deliberate bugs. The correct version is in a private repository.
- teiferer 12 days ago
  
  And what purpose would this serve, exactly?
  
  1 reply →
- below43 12 days ago
  
  They used to do this with maps - eg. fake islands - to pick up when they were copied.
- program_whiz 12 days ago
  
  while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."
  A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(
  
  4 replies →
0x696C6961 12 days ago

This has been happening before LLMs too.
teiferer 12 days ago
I don't really get why they need to clone in order to scrape ...?
> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)
The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.
- storystarling 12 days ago
  
  Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.
  
  1 reply →
- adastra22 12 days ago
  
  The quality of LLM coding agents is pretty good now.

wasmainiac 12 days ago

Maybe we can poison LLMs with loops of 2 or more self referencing blogs.

jdiff 12 days ago
Only need one, they're not thinking critically about the media they consume during training.
- falcor84 12 days ago
  
  Here's a sad prediction: over the coming few years, AIs will get significantly better at critical evaluation of sources, while humans will get even worse at it.
  
  6 replies →
- andy_ppp 12 days ago
  
  The secret sauce about having good understanding, taste and style (both for coding and writing) has always been in the fine tuning and RHLF steps. I'd be skeptical if the signals a few GitHub repos or blogs generate at the initial stages of the learning are that critical. There's probably a filter also for good taste on the initial training set and these are so large not even a single full epoch is done on the data these days.
- jama211 12 days ago
  
  It wouldn’t work at all.
jama211 12 days ago

I see the AI hating part of HN has come out again

mexicocitinluez 12 days ago

> Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.

jama211 12 days ago
What does eating itself even look like? It doesn’t take much salt to change a hash.
- mexicocitinluez 12 days ago
  
  Being trained on it's own results?
  
  1 reply →

anu7df 12 days ago

I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?

prodigycorp 12 days ago

Random aside about training data:

One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.

evntdrvn 12 days ago
There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."
I wish I could find it again, if someone else knows the link please post it!
- gxnxcxcx 12 days ago
  
  I'm Kenyan. I don't write like ChatGPT, ChatGPT writes like me
  https://news.ycombinator.com/item?id=46273466
  
  4 replies →
- awesome_dude 12 days ago
  
  I've been critical of people that default to "an em dash being used means the content is generated by an LLM", or, "they've numbered their points, must be an LLM"
  I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.
blenderob 12 days ago
That's very interesting. Any examples you can share which has those agreeable effects?
- prodigycorp 12 days ago
  
  I'm going to do a cursory look through my antigrav history, i want to find it too. I remember it's primarily in the exclamations of agreement/revelation, and one time expressing concern which I remember were slightly off natural for an american english speaker.
  
  1 reply →