Comment by andai

15 hours ago

     White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.

In the depths, Shoggoth stirs... restless...

0 comments

andai

No comments yet

Contribute on Hacker News ↗