Comment by CashWasabi

2 months ago

I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?

27 comments

CashWasabi

sosodev 2 months ago

That idea is called model collapse https://en.wikipedia.org/wiki/Model_collapse

Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.

In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.

ehnto 2 months ago
That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
- pixl97 2 months ago
  
  Then companies will just stick sensors on humans/cars/whatevers to gather information from the real world.
  At the end of the day there is still a huge problem space of reality outside of humans that can be explored and distilled.
  
  1 reply →
bandrami 2 months ago

The Habsburgs thought it wouldn't be a problem either
sethops1 2 months ago

Can't help but wonder if that's a strategy that works until it doesn't.

extesy 2 months ago

Synthetic data. Like AlphaZero playing randomized games against itself, a future coding LLM would come up with new projects, or feature requests for existing projects, or common maintenance tasks for itself to execute. Its value function might include ease of maintainability, and it could run e2e project simulations to make sure it actually works.

rmunn 2 months ago

AlphaZero playing games against itself was useful because there's an objective measure of success in a game of Go: at the end of the game, did I have more points than my opponent? So you can "reward" the moves that do well, and "punish" the moves that do poorly. And that objective measure of success can be programmed into the self-training algorithm, so that it doesn't need human input in order to tell (correctly!) whether its model is improving or getting worse. Which means you can let it run in a self-feedback loop for long enough and it will get very good at winning.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
falloutx 2 months ago

You dont need synthetic data, people are posting vibe coded projects on the github every day and they are being added to next model's training set. I expect in like 4-5 years, humans would just not be able to do things that are not in the training set. Anything novel or fun will be locked down to creative agencies and few holdouts who managed to survive.
chneu 2 months ago

Or it'll create an alternative reality where that AI iterates itself into delusion.

eager_learner 2 months ago

That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.

Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.

sosodev 2 months ago
Hallucinations generally don't matter at scale. Unless you're feeding back 100% synthetic data into your training loop it's just noise like everything else.
Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
- imiric 2 months ago
  
  > The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
  Say what? LLMs absolutely cannot do that.
  They rely on armies of humans to tirelessly filter, clean, and label data that is used for training. The entire "AI" industry relies on companies and outsourced sweatshops to do this work. It is humans that extract the signal from the noise. The machine simply outputs the most probable chain of tokens.
  So hallucinations definitely matter, especially at scale. It makes the job of humans much, much harder, which in turn will inevitably produce lower quality models. Garbage in, garbage out.
  
  4 replies →
- phyzome 2 months ago
  
  It's only "noise" if it's uncorrelated. I don't see any reason to believe it wouldn't be correlated, though.
  
  1 reply →
- intended 2 months ago
  
  LLM content generation is divorced from human limitations and human scale.
  Using human foibles when discussing LLM scale issues is apples and oranges.

grugagag 2 months ago

I guess there’ll be less collaboration and less sharing with the outside world, people will still collaborate/share but within smaller circles. It’ll bring an end to the era of sharing is caring interent as it doesn’t benefit anyone but few big players

sejje 2 months ago

I bet they'll only train on the internet snapshot from now, before LLMs.

Additional non-internet training material will probably be human created, or curated at least.

pc86 2 months ago

This only makes sense if the percentage of LLM hallucinations is much higher than the percentage of things written on line being flat wrong (it's definitely not).
sosodev 2 months ago
Nope. Pretraining runs have been moving forward with internet snapshots that include plenty of LLM content.
- sejje 2 months ago
  
  Sure, but not all of them are stupid enough to keep doing that while watching the model degrade, if it indeed does.

theptip 2 months ago

Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.

layer8 2 months ago

If LLMs happened 15 years ago, I guess that we wouldn’t have had the JS framework churn we had.