Comment by Workaccount2
9 months ago
On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.
No comments yet
Contribute on Hacker News ↗