Comment by andai
15 hours ago
White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.
In the depths, Shoggoth stirs... restless...
No comments yet
Contribute on Hacker News ↗