Comment by stalfie

12 hours ago

One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem for RLVR to target. Since you're already generating enormous amounts of reasoning traces which are verified by correct answers, just have "don't know" as an option as a valid answer, and on problems where none of the thousands of reasoning traces led to a correct answer, just promote the traces that led to the "don't know" answer as training data. Essentially teaching the model that "I don't know" is a valid answer.

Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.

Because nearly all benchmarks measure "accuracy" by giving you a point for a correct answer, and 0 points for everything else. If you have 100 questions you are 10% certain on, answering "I don't know" to all of those leads to 0 points, answering all of them as if you are confident leads to an expected value of 10 points. So that's what most AIs are trained to do

AA-Omniscience is the only AI benchmark I know of where randomly guessing gets you a lower average score than answering all questions with "I don't know"

  • AA-Omniscience Index gives +100 for correct, 0 for "I don't know" and -100 for incorrect.

    For your scenario the confident confident strategy will give average of -90. Saying I dont't know to all will give 0.

    A lot of models have negative AA-Omniscience Index.

    They also do have AA-Omniscience Accuracy and AA-Omniscience Hallucination Rate that handle "I don't knows" differently.

    https://artificialanalysis.ai/evaluations/omniscience

  • It should be 1 for correct, 0 for don't know and -1 for wrong.

    They are much better incentives. In real life a wrong answer is much more damaging than a don't know.

    • "AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct."

      https://artificialanalysis.ai/evaluations/omniscience

    • See, this, to me, seems obvious, but I’m sure it’s more challenging/complex than I can imagine (I am NOT an expert on AI in any way imaginable). But there has to be a solution. Just yesterday I was asking Gemini to tell me about a certain college professor, and it gave me a list of facts about them. And it was perfect. Then, out of curiosity, I followed up with “tell me more about him!” and it spit out several more bits of information about this person that were entirely hallucinated (e.g., gave them credit for writing papers they didn’t write, said they won awards that actually someone else won). I know this is all complex and certainly beyond my limited skill set, but goodness, we’ve got to get this figured out with so many people depending on and trusting these things nowadays. It’s quite scary.

      3 replies →

    • Maybe some extra buckets could be added like depending on whether the answer ought to be known. Or, quality of the justification. “I don’t know and here’s a good reason why” is much better than “idk.” Correctly identifying that something is fundamentally unknown/unknowable is probably better than a simply-correct answer, even, right?

    • And also because it creates "one neat trick" where it can answer "I don't know" for many/most things and still get credit.

    • > In real life a wrong answer is much more damaging than a don't know.

      I don't know. Is it?

The main problem here is that hallucination suppression doesn’t generalise. We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations. With current architectures, hallucinations will likely persist on open-domain tasks forever.

  • > We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations

    I don't think anyone is trying to add "a coherent worldview" by reducing hallucinations, not sure how that even realistically could be aim.

    What people want, is for the models to stop giving confident answers that are clearly incorrect. Yes, it won't lead to "a coherent worldview", but it'll at least stop wasting people's time if the model said "You know what, what you said doesn't make sense / isn't clear, is what you mean .... ?" or even "I'm not sure" or "I don't know".

    Currently, if you have the wrong starting point, ask the model to do something, they more often than not just go ahead and do that, misunderstandings or not. They seem optimized to never push back, unless you prompt for that, and most seem to favor "I'm just gonna assume X" rather than taking a step back and figuring out how to not assume. Again, unless you prompt against that behaviour/steering it into a different workflow.

I think the trouble is in the outputs of the LLM and how it's interpreted by the tooling. The output is a distribution of probabilities of all possible next tokens. Even if the probability of every token is very low, the output gets normalized so that the sum of all probabilities is 1. So after that step, it's hard to see if the model was strongly preferring certain tokens or if you're just looking at amplified noise.

Training an extra "don't know" token means you have to build a moat between every other token. Between "yes" and "no", you don't have a muddled noisy area where both "yes" and "no" have relatively high probabilities, you need a new peak where "don't know" is higher. Then you just have new muddled areas between "yes" and "don't know", and "don't know" and "no". That requires even more finesse to train another answer in between.

Instead, you could check whether multiple options are about equally likely. But then you have to check if they are actually synonyms, like are the top two choices "Genève" and "Geneva", which is a good sign that the model knows the answer? Or are the top two "yes" and "no"?

It’s not as simple. I trained an LLM before on exactly this, to scratch the itch of this question.

The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:

1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer

2. Questions paired only with irrelevant results, with the answer “No answer present”

The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)

Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).

Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).

tl;dr: not as simple as one might think, perhaps not attainable at all.

0: https://huggingface.co/datasets/microsoft/ms_marco

  • Thank you for sharing! Based on your experience, do you think a two-model system might fare better? For example, two models in serial where the second model is trained to "sniff out" potential hallucinations and fact check them (and possibly iterate with the first model)?

    • I do think it might improve but only marginally.

      You are however likely to observe better results in smaller models since they're usually more strapped for "cognitive capacity", so two separate calls reduce the load in each request, and hallucination in my experience is a common side effect of overloading an LLM cognitively.

If you could write that reward function you wouldn't need an LLM, you'd just query the reward function to answer any question. You can create a benchmark and check that automatically, but you can't solve this in the general case. The model can do well on the benchmark but still give overconfident answers in areas the benchmark doesn't cover.

You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.

  • I guess so. Just to be clear, I was talking about post-training methods for reasoning models here, not pre-training. I think "model as a judge" should actually do okay as a "sentiment analysis" style reward for expressing uncertainty. So if none of the thousands of reasoning traces you generate reach the validated answer, you run a judge to rate uncertainty and put those reasoning traces back into the training pool.

    But I guess my logic breaks down here a bit, because if there is such a thing as a validated answer, then the correct answer is in fact never uncertainty. The correct answer is to continue post training until the model gets it right. So perhaps the real answer is to create RLVR tasks where the valid answer is "I don't know" and nothing else, like this benchmark does. Or maybe that doesn't work either, no matter how many you create.

    I feel as though there is some kind of philosophical lesson to be had from how hard hallucinations are to get rid of. Maybe, similarly to humans, successful models are often "arrogant" in a sense. Perhaps you just never solve an Erdös problem without some degree of self deception that it's possible for you to do so. In this line of thinking, greatness in humans is actually not related to humility, but just being so good that you actually get things right when you try. Expressing humility is of course something great people tend to do, but I'm referring to what happens under the hood.

    If you squint a bit, that's kinda the trend with models. The useful ones are not that much less likely to hallucinate, they are just good enough that they tend to get it right. This comparison is of course probably not even remotely correct, but at least it's fun to anthropomorphize a bit.

If we had a theoretical technique to identify the true and objective reality we'd use it in the courts and laboritories. There is no such technique, but what we do have is 2 techniques that seem work:

1) Has a certain standard of evidence been met?

2) Are the related arguments free of logical inconsistencies?

We can train the LLMs to do 2, and maybe even 1 to some extent (exactly what quality of evidence a computer can practically gather is limited). But that isn't going to get rid of hallucinations, for the same reason courts are hit-and-miss or the conclusions of studies often aren't very reliable. These techniques help, but sometimes they still get people to say things that, on close inspection, turn out to be nonsense. And those best-effort approaches are too much to expect for most questions an LLM will be handed which are informal, low stakes and don't need strong supporting evidence or logical rigour.

I think it is underestimated how many LLM-style hallucinations people themselves have. It just isn't obvious because most humans have a strategy of only repeating what the herd says after it has been socially vetted, which makes their individual eccentricities less obvious.

TLDR; I don't think it looks like an easy problem for RLVR, it looks technically unsolvable. Even making progress requires a philosophical breakthrough on the nature of truth so that the objective function can be established.

  • Well, I'd argue that this depends on the field you're investigating. Sometimes you have a way to identify objective reality and sometimes you don't. In mathematics the majority of the field is verifiable in this way. Coding a bit less as it's intersubjective, as and the ideal methodology is subject to taste.

    But even in muddy fields of reality like medicine, there are objective facts to be found. When someone comes into an ER with chest pain, you often find a true, undeniable reason for why that is happening. If their lung has collapsed, a coronary artery is clogged or the aortic artery is dissecting, even if you don't find that out it tends to be clear in retrospect. The area of reality that becomes muddy is when use proxy signals to try to figure out who gets promoted to expensive/harmful examinations we can make final conclusions from, or the cases that don't fit cleanly into one bucket or the other. But very often, the gold standard truly is golden.

    Of course, many realms of reality cannot be verified in this way. But I'd argue that there are quite a few that can.

    • > In mathematics the majority of the field is verifiable in this way.

      Does mathematics count as not a hallucination though? Particularly in pure mathematics they take a certain pride coming up with wild concepts as unrooted as possible in anything relevant to human existence. The name of the game is purely about maintaining internal logical consistency - which is something an AI can do while hallucinating.

      AI hallucinations in maths might be logically consistent or not be. But in that particular case it starts to get a bit iffy what we call it when someone imagines something that doesn't exist. This gets back to the thing where we can train AIs to be logically consistent, but we can't force that consistency to be grounded in any particular universe. Ie, it'll hallucinate but in a very well rationalised way - coincidentally mimicking how a number of mathematicians seem to approach life.

      This is the central issue; there is a very real trade-off between facts and verifiablity. Mathematics is perfectly verifiable because it is fact free. We don't have a reliable general system to verify facts. We do have reliable systems for checking arguments (logic).

      1 reply →

But if an LLM says "I don't know" should you pay for the tokens?

  • Why not? It did the work. Why should you expect it to be omniscient?

    We can rank them based on how much they know and people will gravitate towards those that do know more.

    It's a market after all.

  • "I don't know" has positive value, presumably you could prompt further to learn more about where it got stuck. It also increases the value of correct answers, by improving confidence that answers are actually correct.

    "Confidently incorrect" has negative value. At best, a human realizes the answer is wrong and At worst, the incorrect information makes is not identified and can cause untold damage. By having the potential to be so severely wrong, it lessens the value of correct answers because there is a lower confidence value on their output.

  • Depends on what your understanding of the product is.

    If someone sold you a "Solved all your problems" machine, and it suddenly doesn't solve all your problems, then probably no, you shouldn't pay.

    But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input", then regardless of what the outcome is, I still made use of the "Input > Output" part, which is what I bought into, so I should still pay for that.

    Now of course bunch of people will say they been sold the former, but the companies themselves seem to be selling the latter. That's my perspective from a person who doesn't follow "influencers" and what not though, which seem to be selling the public on the former rather than the latter.

    • Let's pretend I am someone who has heard people talk about ChatGPT, but have on idea what it actually is. I go to the website and am not presented with any information, just a prompt. So I ask it what it is and what it can do for me.

      My ask:

      > In a couple sentences, explain to me the product I'm being sold with ChatGPT. What does it do for me?

      The Reply from ChatGPT:

      > ChatGPT is a conversational AI that helps you think, create, learn, analyze, and get things done faster. You can use it to answer questions, draft and edit writing, summarize information, brainstorm ideas, learn new topics, write code, plan projects, and increasingly act as an assistant that can search for information, work with documents, generate images, and help complete tasks.

      > In simple terms: you're buying access to an AI that turns natural language into useful work—saving time, expanding your capabilities, and giving you an always-available collaborator for both everyday tasks and specialized knowledge work.

      This sounds much more like the former, a "solve all your problems" machine.... not a plausible-sounding text generation machine.

      Only two weeks ago Sam Altman said their new data center "could" be where cancer gets cured[0]. It is only the people who deeply understand AI who see it as a text generator of plausible-sounding text. That isn't what the marketing department, the CEO, or the product itself seem to be saying. I'm using OpenAI as the example here, but the others don't seem much different.

      [0] https://www.youtube.com/watch?v=9-tOtbDDrJA

      2 replies →

    • The marketing materials are very much the former though. From claude.com:

      > If you can dream it, Claude can help you do it. Claude can process large amounts of information, brainstorm ideas, generate text and code, help you understand subjects, coach you through difficult situations, simplify your busywork so you can focus on what matters most, and so much more.

      What marketing copy have you read for LLMs that is like you mentioned?

      > But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input"

  • I would be very willing to pay more! The choice between “you may get a correct answer, or you may get lied to, without a clear way to distinguish between the two” and “you may get a correct answer, or a clear indication that the answer was not found” is pretty clear. One is a much more useful tool than the other. I don’t see any real incentives for companies making LLMs to keep their AI factually unreliable. (Full disclosure: I work for one, but I’m definitely not in the rooms where such decisions would be made.)

  • 'I don't know' is the correct answer for infinitley more questions than those that can be answered.

the problem is the null answer will stop the "markov" chain.

so, thats all.

  • You dont have to literally send a null token. Train it to generate text that summarizes the evidence that is there but the uncertainty of the final answer to a prompt.

  • Transformers are not Markovian, their whole point is arguably to be the reverse of Markovian, to efficiently make it so the new tokens are a function of all previous tokens