Comment by alwa

4 months ago

Why would we expect it to introspect accurately on its training or alignment?

It can articulate a plausible guess, sure; but this seems to me to demonstrate the very “word model vs world model” distinction TFA is drawing. When the model says something that sounds like alignment techniques somebody might choose, it’s playing dress-up, no? It’s mimicking the artifact of a policy, not the judgments or the policymaking context or the game-theoretical situation that actually led to one set of policies over another.

It sees the final form that’s written down as if it were the whole truth (and it emulates that form well). In doing so it misses the “why” and the “how,” and the “what was actually going on but wasn’t written about,” the “why this is what we did instead of that.”

Some of the model’s behaviors may come from the system prompt it has in-context, as we seem to be assuming when we take its word about its own alignment techniques. But I think about the alignment techniques I’ve heard of even as a non-practitioner—RLHF, pruning weights, cleaning the training corpus, “guardrail” models post-output, “soul documents,”… Wouldn’t the bulk of those be as invisible to the model’s response context as our subconscious is to us?

Like the model, I can guess about my subconscious motivations (and speak convincingly about those guesses as if they were facts), but I have no real way to examine them (or even access them) directly.

0 comments

alwa

No comments yet

Contribute on Hacker News ↗