← Back to context

Comment by Qwuke

12 hours ago

I've seen Optuna used with some of the prompt optimization frameworks lately, where it's a really great fit and has yielded much better results than the "hyperparameter" tuning I had attempted myself. I can't stop mentioning how awesome a piece of software it is.

Also, I'm eager to see how well gpt-oss-120b gets uncensored if it really was using the phi-5 approach, since that seems fundamentally difficult given the training.

FWIW, I already used Heretic to decensor gpt-oss-20b [1], and it works just fine. Note that the number of refusals listed on the model card is actually an overestimate because refusal trigger words occur in the CoT, even though the model doesn't actually end up refusing in the end.

[1] https://huggingface.co/p-e-w/gpt-oss-20b-heretic

  • What's your intuition on other "directions"? Have you tried it on something other than "refusals"? Say "correctness" in math or something like that. I have some datasets prepared for DPO on "thinking" traces that are correct / incorrect, wondering if it'd be something that could work, or if it's out of scope (i.e. correctness is not a single direction, like refusal training)

    • The problem is that in order to do optimization, you need a classifier that can distinguish the two types of responses (like refusal/compliance). In case of refusals, that's relatively easy to do using trigger words like "disallowed" or "I can't". I imagine this would be much, much harder to do automatically for classes like correctness.

      And I also suspect, as you hint at, that "correctness" isn't just a direction in residual space, but a concept so broad that no simple mechanistic description can capture it.