← Back to context

Comment by energy123

19 hours ago

The thing that stands out is fine-tuning a verifier with human labels specifically so that it isn't sycophantic in either direction. If you've ever tried to do a verifier in a multi-agent system you'll recognize the annoyance of the verifier swinging wildly from "this is brilliant" to "this is trash" based on nothing more than fudging a few suggestive words in the candidate answer it's tasked with reviewing. Making the verifier invariant to those fudge words and forcing it to actually reason (... as per Anthropic's interpretability work) would be quite nice.