Comment by jstummbillig

3 days ago

Simple: You can ask a LLM and can get a good explanation for why it did something, that will help you avoid bad behavior next time.

Is that reasoning? Does it know? I might care about those questions in another context but here I don't have to. It simply works (not all the time, but increasingly so with better models in my experience.)

This assumes that the tokens it outputs are a good description of the tool's behavior. That's not necessarily true though. For example, the LLM may be trained such that a lot of its input data is "LLMs often hallucinate", so the LLM may be biased to say "I hallucinated that" even if there's some more structural issue.

I think there's something here to consider, but it's sort of like assuming that the LLM has reasons for doing things when it only has weights for which tokens are produced - thats the sum of its reasoning.

Maybe it's the case that LLM tokens to correlate to truth values or that this approach actually provides value but there's probably good reason to be skeptical, given that we'd need to posit some sort of causative function of "token outputs" to reasoning about prior behaviors.

Nah many times I ask Claude about its behavior, features etc and it either tells me to check the Anthropic web site or goes look for it in the web site itself (useless most of the time).

  • It can be damn near impossible to break them out of some loops once they've committed. Gotta trim the context back to before the behaviour started.

I have not ever found an explanation of an LLM behavior by that LLM to be reliable. Why does anyone bother? They are guessing. It’s like asking Manson why he kills.