Comment by teiferer
7 hours ago
I don't really get why they need to clone in order to scrape ...?
> It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)
The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.
Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.
The quality of LLM coding agents is pretty good now.