Comment by energy123
20 hours ago
The thing that stands out is fine-tuning a verifier with human labels specifically so that it isn't sycophantic in either direction. If you've ever tried to do a verifier in a multi-agent system you'll recognize the annoyance of the verifier swinging wildly from "this is brilliant" to "this is trash" based on nothing more than fudging a few suggestive words in the candidate answer it's tasked with reviewing. Making the verifier invariant to those fudge words and forcing it to actually reason (... as per Anthropic's interpretability work) would be quite nice.
No comments yet
Contribute on Hacker News ↗