Comment by sosodev

2 months ago

That idea is called model collapse https://en.wikipedia.org/wiki/Model_collapse

Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.

In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.

5 comments

sosodev

ehnto 2 months ago

That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.

For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.

If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.

That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.

pixl97 2 months ago
Then companies will just stick sensors on humans/cars/whatevers to gather information from the real world.
At the end of the day there is still a huge problem space of reality outside of humans that can be explored and distilled.
- ehnto 1 month ago
  
  Sorry I guess it's not very clear from my post, the data points aren't what's missing it is the insights. An insight comes from the melding of a life of experiences, you can't just stick a bunch of sensors on humans and reach the same insights. The expression of latent space in our own brains you could think of it as. We're also not little one way input boxes, we run world sims in our brains all day long.
  If you were to frame human brains as their own world models, Stack Overflow was a very lossy distillation from the brains of those insights.
  I don't think you reach those insights by simply piping in data from the world (also that sounds expensive to do at a worthwhile scale)

bandrami 2 months ago

The Habsburgs thought it wouldn't be a problem either

sethops1 2 months ago

Can't help but wonder if that's a strategy that works until it doesn't.