Comment by russdill
19 hours ago
It's not an error though. From is training it's outputting things most likely to come next. Saying it's an error means that being accurate is a feature and a bug that can be fixed.
It's of course not actually hallucinating. That's just the term that's been chosen to describe what's going on
Like cubic splines, the data will be on the line. Everything in-between the points may or may not be true. But it defiantly conforms to the formula.
Wonder if it would be possible to quantify margin of error between different nodes in these models. But even what is 'in between' still conforms to the formula. But not necessarily what it should be. A simple 2 node model should be 'easy' to quantify but these models with thousands of nodes what does it mean to be +/- x percent from the norm. Is it a simple sum or something else to quantify it.
Being accurate is a feature and it is a bug that can be fixed though.
Given various models, one that always produces statements that are false and another that only sometimes produces false statements, the latter model is preferable and the model which most people intend to use, hence the degree to which a model produces correct statements is absolutely a feature.
And yes, it's absolutely possible to systematically produce models that make fewer and fewer incorrect statements.
It's nice that you feel that way, but reality is at odds with your sentiment. Even if the LLM is trained on completely 100% factual human-checked data, its mechanism is still predicting the next word, and what it is not is a mechanism designed to return only factual data. There is no such thing as an infallible LLM, no matter the model or how it was trained.
Sure, some may return results that are sometimes more true than others, but a broken clock is also right twice a day. The more broken clocks you have, the more chance there is that one of them is correct.
No, the user you replied to is correct. Accuracy is indeed a feature, and can be incrementally improved. "Predicting the next word" is indeed a mechanism that can be improved to return increasingly accurate results.
Infallibility is not a feature of any system that operates in the real world. You're arguing against a strawman.
It's nice that you feel that having one LLM that generates entirely incorrect statements is equally as functional as an LLM that does not, but reality in terms of what LLMs people will actually use in real life and not for the sake of being pedantic over an Internet argument is very much at odds with your sentiment.
How a product happens to currently be implemented using current machine learning techniques is not the same as the set of features that such a product offers and it's absolutely the case that actual researches in this field, those who are not quibbling on the Internet, do take this issue very seriously and devote a great deal of effort towards improving it because they actually care to implement possible solutions.
The feature set, what the product is intended to do based on the motivations of both those who created it and those who consume it, is a broader design/specification goal, independent of how it's technically built.
1 reply →
> It's not an error though
!define error
> 5. Mathematics The difference between a computed or measured value and a true or theoretically correct value.
^ this is the definition that applies. There is a ground truth (the output the user expects to receive) and model output. The difference between model output and ground truth ==> error.
--
> From is training it's outputting things most likely to come next
Just because a model has gone through training, does not mean the model won't produce erroneous/undesirable/incorrect test-time outputs.
--
> Saying it's an error means that being accurate is a feature and a bug that can be fixed.
Machine learning doesn't revolve around boolean "bug" / "not bug". It is a different ballgame. The types of test-time errors are sometimes just as important as the quantity of errors. Two of the simpler metrics for test-time evaluation of natural language models (note: not specifically LLMs) are WER (Word Error Rate) and CER (Character Error Rate). A model with a 3% CER isn't particularly helpful when the WER is 89%. There are still "errors". They're just not something that can be fixed like normal software "errors".
It is generally accepted some errors will occur in the world of machine learning.
- edit to add first response and formatting
I don't agree that that's the right definition to use though. LLMs do not output computed or measured values.
If I expect Windows to add $5 to my bank account every time I click the Start button, that's not an error with Windows, it's a problem with my expectations. It's not a thing that's actually made to do that. The start button does what it's supposed to (perhaps a bad example, because the windows 11 start menu is rubbish), not my imagined desired behavior.
> LLMs do not output computed or measured values.
LLMs output a vector of softmax probabilities for each step in the output sequence (the probability distribution). Each element in the vector maps to a specific word for that sequence step. What you see as a "word" in LLM output is "vector position with 'best' probability in softmax probability distribution".
And that is most definitely a computed value. Just because you don't see it, doesn't mean it's not there.
https://medium.com/@22.gautam/softmax-function-the-unsung-he...
https://www.researchgate.net/publication/349823091/figure/fi...