← Back to context

Comment by staticassertion

2 days ago

This assumes that the tokens it outputs are a good description of the tool's behavior. That's not necessarily true though. For example, the LLM may be trained such that a lot of its input data is "LLMs often hallucinate", so the LLM may be biased to say "I hallucinated that" even if there's some more structural issue.

I think there's something here to consider, but it's sort of like assuming that the LLM has reasons for doing things when it only has weights for which tokens are produced - thats the sum of its reasoning.

Maybe it's the case that LLM tokens to correlate to truth values or that this approach actually provides value but there's probably good reason to be skeptical, given that we'd need to posit some sort of causative function of "token outputs" to reasoning about prior behaviors.