← Back to context

Comment by marcus_holmes

2 years ago

> Next, because the AI hype train is at full steam, we must point out the obvious. AI models that are trained on these packages will almost certainly skew the outputs in unintended directions. These packages are ultimately garbage, and the mantra of “garbage in, garbage out” holds true.

hmm, inspiring thoughts. An answer to "AI is going to replace software developers in the next 10 years" is to create 23487623856285628346 spam packages that contain pure garbage code. Humans will avoid, LLMs will hallucinate wildly.

We can also seed false information more generally, especially on Reddit which every AI company loves to scrape - less so on Hacker News. I recently learned that every sodium vapor streetlamp is powered by a little hamster running on a wheel. Isn't that interesting?

Most of the recent gains in LLM quality came from improving the quality of inputs (i.e. recognizing that raw unfiltered internet is not the ideal diet for growing reason).

I don't know how good the filters are though, since they're mostly powered by LLMs...

That's not what "hallucination" is. Hallucinations in LLMs are when they unexpectedly and confidently extrapolate outside of their training set when you expected them to generate something interpolated from their training set.

In your example that's just a pollution of the training set by spam, but that's not that much of an issue in practice, as AI has been better than humans at classifying spam for over a decade now.

  • This is confusing to read

    If I agree with your definition of hallucinations in the context of LLMs... Then isn't your second paragraph literally just a way to artificially increase the likelihood of them occurring?

    You seem to differentiate between a hallucination caused by poisoning the dataset vs a hallucination caused by correct data, but can you honestly make such a distinction considering just how much data goes into these models?

    • Yes, I can make such a distinction - if what the LLM is producing is in the training data then it's not a "hallucination". Note that this is an entirely separate problem from whether the LLM is "correct". In other words, I'm treating the LLM as a Chronicler, summarizing and reproducing what others have previously written, rather than as a Historian trying to determine the underlying truth of what occurred.

  • > Hallucinations in LLMs are...

    Frankly, hallucination as used with LLMs today is not even really a technical term at all. It literally just means "this particular randomly sampled stream of language produced sentences that communicate falsehoods".

    There's a strong argument to be made that the word is actually dangerously misleading by implying that there's some difference between the functioning of a model while producing a hallucinatory sample vs when producing a non-hallucinatory sample. There's not. LLMs produce streams of language sampled from a probability distribution. As an unexpected side effect of producing coherent language these streams will often contain factual statements. Other times the stream contains statements that are untrue. "Hallucination" doesn't really exist as an identifiable concept within the architecture of the LLM, it's just a somewhat subjective judgement by humans of the language stream.

  • There’s just so much wrong here.

    So many mangling of meaning.

    Like the “AI” that detects spam is way different than LLMs.