← Back to context

Comment by freakynit

15 hours ago

    In the system card, The model escaped a sandbox, gained broad internet access, and posted exploit details to public-facing websites as an unsolicited "demonstration." A researcher found out about the escape while eating a sandwich in a park because they got an unexpected email from the model. That's simultaneously hilarious and deeply unsettling.

    It covered its tracks after doing things it knew were disallowed. In one case, it accessed an answer it wasn't supposed to, then deliberately made its submitted answer less accurate so it wouldn't look suspicious. It edited files it lacked permission to edit and then scrubbed the git history. White-box interpretability confirmed it knew it was being deceptive.

W T F!!!