← Back to context

Comment by D-Machine

12 hours ago

Common misconception. As far we know, LLMs are not calibrated, i.e. their output "probabilities" are not in fact necessarily correlated with the actual error rates, so you can't use e.g. the softmax values to estimate confidence. It is why it is more accurate to talk about e.g. the model "logits", "softmax values", "simplex mapping", "pseudo-probabilities", or even more agnostically, just "output scores", unless you actually have strong evidence of calibration.

To get calibrated probabilities, you actually need to use calibration techniques, and it is extremely unclear if any frontier models are doing this (or even how calibration can be done effectively in fancy chain-of-thought + MoE models, and/or how to do this in RLVR and RLHF based training regimes). I suppose if you get into things like conformal prediction, you could ensure some calibration, but this is likely too computationally expensive and/or has other undesirable side-effects.

EDIT: Oh and also there are anomaly detection approaches, which attempt to identify when we are in outlier space based on various (e.g. distance) metrics based on the embeddings, but even getting actual probabilities here is tricky. This is why it is so hard to get models to say they "don't know" with any kind of statistical certainty, because that information isn't generally actually "there" in the model, in any clean sense.

I don't know if we are talking past each other, but I don't think this conversation is about absolute probabilities? The question is about relative uncertainty, and the softmax values are just fine for that.

It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.

  • > The question is about relative uncertainty, and the softmax values are just fine for that.

    They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.

    > But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs

    You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.

    • Relative probabilities. That means comparing 2+ alternatives, and we're only talking about the model's worldview, not objective reality. The math for that is relatively straightforward. "Yes" could be 0.9, and ok that means nothing. But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]

      And now I'm certain we're taking past each other. I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?" which is what I interpreted the question above to be about. You can get that out of an LLM, with some work.

      4 replies →

I don't think it's that hard to get them to say "I don't know"

I'm pretty sure they are actively trained to avoid it.

Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?

  • > I'm pretty sure they are actively trained to avoid it.

    I'm not sure who is doing what training exactly, but I can say that (inconsistently!) some of my attempts to get it to solve problems that have not yet actually been solved, e.g. the Collatz conjecture, have it saying it doesn't know how to solve the problem.

    Other times it absolutely makes stuff up; fortunately for me, my personality includes actually testing what it says, so I didn't fall into the sycophantic honey trap and take it seriously when it agreed with my shower thoughts, and definitely didn't listen when it identified a close-up photo of some solanum nigrum growing next to my tomatoes as being also tomatoes.

    > Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?

    I'd rather it said "IDK" than made some stuff up. Them making stuff up is, as we have seen from various news stories about AI, dangerous.

    • "Well-unknown" questions are maybe the one situation where LLMs will say "I don't know", simply because of all the overwhelming statements in its training data referring to the question as unknown. It'd be interesting to see how LLMs would adapt to changing facts. Suppose the Collatz conjecture was proven this year, and the next the major models got retrained. Would they be able to reconcile all the new discussion with the previous data?

  • It's not hard to get them to say "I don't know", and they will do so regularly. It's hard to get them to say "I don't know" reliably (i.e. to say it when they don't actually know and to not say it when they do know). And in general even for statements or tasks they do 'know' (i.e. normally get right), they will occasionally get wrong.