Comment by robrenaud

3 months ago

LLMs do have some internal representations that predict pretty well when they are making stuff up.

https://arxiv.org/abs/2509.03531v1 - We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B)

2 comments

robrenaud

brokenmachine 3 months ago

You're right — I got that wrong. Ah, I misunderstood what you were asking. Good catch — I miscalculated there. You're right — the code I provided won’t run as written. I made an assumption that wasn’t supported by your question. My previous answer was incomplete. Thanks for the example — you’re correct. Yes, that was inconsistent. With that additional information, my earlier answer no longer applies. You're absolutely right to question that.

Lol!