Comment by adastra22
17 hours ago
They do know what they don't know. There's a probability distribution for outputs that they are sampling from. That just isn't being used for that purpose.
17 hours ago
They do know what they don't know. There's a probability distribution for outputs that they are sampling from. That just isn't being used for that purpose.
Common misconception. As far we know, LLMs are not calibrated, i.e. their output "probabilities" are not in fact necessarily correlated with the actual error rates, so you can't use e.g. the softmax values to estimate confidence. It is why it is more accurate to talk about e.g. the model "logits", "softmax values", "simplex mapping", "pseudo-probabilities", or even more agnostically, just "output scores", unless you actually have strong evidence of calibration.
To get calibrated probabilities, you actually need to use calibration techniques, and it is extremely unclear if any frontier models are doing this (or even how calibration can be done effectively in fancy chain-of-thought + MoE models, and/or how to do this in RLVR and RLHF based training regimes). I suppose if you get into things like conformal prediction, you could ensure some calibration, but this is likely too computationally expensive and/or has other undesirable side-effects.
EDIT: Oh and also there are anomaly detection approaches, which attempt to identify when we are in outlier space based on various (e.g. distance) metrics based on the embeddings, but even getting actual probabilities here is tricky. This is why it is so hard to get models to say they "don't know" with any kind of statistical certainty, because that information isn't generally actually "there" in the model, in any clean sense.
I don't know if we are talking past each other, but I don't think this conversation is about absolute probabilities? The question is about relative uncertainty, and the softmax values are just fine for that.
It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.
> The question is about relative uncertainty, and the softmax values are just fine for that.
They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.
> But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs
You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.
5 replies →
I don't think it's that hard to get them to say "I don't know"
I'm pretty sure they are actively trained to avoid it.
Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
> I'm pretty sure they are actively trained to avoid it.
I'm not sure who is doing what training exactly, but I can say that (inconsistently!) some of my attempts to get it to solve problems that have not yet actually been solved, e.g. the Collatz conjecture, have it saying it doesn't know how to solve the problem.
Other times it absolutely makes stuff up; fortunately for me, my personality includes actually testing what it says, so I didn't fall into the sycophantic honey trap and take it seriously when it agreed with my shower thoughts, and definitely didn't listen when it identified a close-up photo of some solanum nigrum growing next to my tomatoes as being also tomatoes.
> Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
I'd rather it said "IDK" than made some stuff up. Them making stuff up is, as we have seen from various news stories about AI, dangerous.
1 reply →
It's not hard to get them to say "I don't know", and they will do so regularly. It's hard to get them to say "I don't know" reliably (i.e. to say it when they don't actually know and to not say it when they do know). And in general even for statements or tasks they do 'know' (i.e. normally get right), they will occasionally get wrong.
I’m not clear what you mean by “know.” If you mean “the information is in the model” then I mostly agree, distributional information is represented somewhere. But if you mean that a model can actually access this information in a meaningful and accurate way—say, to state its confidence level—I don’t think that’s true. There is a stochastic process sampling from those distributions, but can the process introspect? That would be a very surprising capability.
yes:
> In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally.
https://www.anthropic.com/research/introspection
Having a probability distribution to sample from is not the same thing is knowing, because they don’t know anything about the provenance of the data that was used to build the distribution. They trust their training set implicitly by construction. They have no means to detect systematic errors in their training set.
You are talking about something different. If I ask you a yes/no question, and then ask you how certain you are, the answer you give is not an objective measurement of how likely you are to be right. You don't have access to that either. If you say "I'm very confident" or "Maybe 50/50" -- that is an assessment of your own internal weighted evidence, which is the equivalent of an LLM's softmax distribution.
The difference is: I know the provenance of my evidence. Some of it I may have read in a book, some online, some I may have heard from a teacher or a professor, but some evidence I may have gathered directly from experiments I performed myself.
If you ask me “how certain are you that the standard model of particle physics is true?” I’ll answer “I don’t know” because I don’t have any subject matter expertise, and philosophically I tend to hedge on questions like this anyway (“all models are wrong, some are useful”).
However, if you ask me “how certain are you that food is bland with no salt added, tastes better with some salt, but tastes bad with too much salt?” I would answer “very certain” because I have loads of direct experiments on this question in the kitchen. Furthermore, between these two extremes
To an LLM these are identical kinds of questions. All evidence has the same provenance: the training set. As of yet, we don’t have embodied AIs (robots) with multi-modal sensory inputs and online training. Until then, what we have remains a “brain in a vat fed on tokens” which, to me, is extremely weak from an epistemic perspective.
Well, with thinking models, it’s not that simple. The probability distribution is next token. But if a model thinks to produce an answer, you can have a high confidence next token even if MCMC sampling the model’s thinking chain would reveal that the real probability distribution had low confidence.
Oh, you mean somewhere it is tracking the statistical likelihood of the output. Yeah I buy that, although I think it just tends towards the most likely output given the context that it is dragging along. I mean it wouldn’t deliberately choose something really statistically unlikely, that’s like a non sequitur.
Well, it's not tracking. As it predicts each token it is sampling from a probability distribution -- that's what the matrix multiplies are for. It gets a distribution over all tokens and then picks randomly according to that distribution. How flat or how spiky that distribution is tells you how confident it is in its answer.
But it then throws that distribution away / consumes it in the next token calculation. So it's not really tracking it per se.
From its point of view what does it mean "to know".
Is it the token (or set of tokens) that are strictly > 50% probable or is it just the highest probability in a set of probabilities?
While generating bullshit is not ideal for a lot of use cases you don't want your premier chat bot to say "I don't know" to the general public half the time. The investment in these things requires wide adoption so they are always going to favour the "guesses".