Comment by eager_learner

2 months ago

That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.

Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.

9 comments

eager_learner

sosodev 2 months ago

Hallucinations generally don't matter at scale. Unless you're feeding back 100% synthetic data into your training loop it's just noise like everything else.

Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.

imiric 2 months ago
> The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
Say what? LLMs absolutely cannot do that.
They rely on armies of humans to tirelessly filter, clean, and label data that is used for training. The entire "AI" industry relies on companies and outsourced sweatshops to do this work. It is humans that extract the signal from the noise. The machine simply outputs the most probable chain of tokens.
So hallucinations definitely matter, especially at scale. It makes the job of humans much, much harder, which in turn will inevitably produce lower quality models. Garbage in, garbage out.
- sosodev 2 months ago
  
  I think you're confused about the training steps for LLMs. What the industry generally calls pre-training is when the LLM learns the job of predicting the most probable next token given a huge volume of data. A large percentage of that data has not been cleaned at all because it just comes directly from web crawling. It's not uncommon to open up a web crawl dataset that is used for pretraining and immediately read something sexual, nonsensical, or both really.
  LLMs really do find the signal in this noise because even just pre-training alone reveals incredible language capabilities but that's about it. They don't have any of the other skills you would expect and they most certainly aren't "safe". You can't even really talk to a pre-trained model because they haven't been refined into the chat-like interface that we're so used to.
  The hard part after that for AI labs was getting together high quality data that transforms them from raw language machines into conversational agents. That's post-training and it's where the armies of humans have worked tirelessly to generate the refinement for the model. That's still valuable signal, sure, but it's not the signal that's found in the pre-training noise. The model doesn't learn much, if any, of its knowledge during post-training. It just learns how to wield it.
  To be fair, some of the pre-training data is more curated. Like collections of math or code.
  
  3 replies →
phyzome 2 months ago
It's only "noise" if it's uncorrelated. I don't see any reason to believe it wouldn't be correlated, though.
- sosodev 2 months ago
  
  Are you sure about that? There's a lot of slop on the internet. Imagine I ask you to predict the next token after reading an excerpt from a blog on tortoises. Would you have predicted that it's part of an ad for boner pills? Probably not.
  That's not even the worst scenario. There are plenty of websites that are nearly meaningless. Could you predict the next token on a website whose server is returning information that has been encoded incorrectly?
intended 2 months ago

LLM content generation is divorced from human limitations and human scale.
Using human foibles when discussing LLM scale issues is apples and oranges.