← Back to context

Comment by NitpickLawyer

14 hours ago

What's your intuition on other "directions"? Have you tried it on something other than "refusals"? Say "correctness" in math or something like that. I have some datasets prepared for DPO on "thinking" traces that are correct / incorrect, wondering if it'd be something that could work, or if it's out of scope (i.e. correctness is not a single direction, like refusal training)

The problem is that in order to do optimization, you need a classifier that can distinguish the two types of responses (like refusal/compliance). In case of refusals, that's relatively easy to do using trigger words like "disallowed" or "I can't". I imagine this would be much, much harder to do automatically for classes like correctness.

And I also suspect, as you hint at, that "correctness" isn't just a direction in residual space, but a concept so broad that no simple mechanistic description can capture it.