Comment by famouswaffles

9 days ago

>Considering the limited evidence we have, why is pure unprompted untrained misalignment, which we never saw to this extent, more believable than other causes, of which we saw plenty of examples? It's more interesting, for sure, but would it be even remotely as likely? From what we have available, and how surprising such a discovery would be, how can we be sure it's not a hoax?

>Unless all LLM providers are lying in technical papers, enormous effort is put into safety- and instruction training.

The system cards and technical papers for these models explicitly state that misalignment remains an unsolved problem that occurs in their own testing. I saw a paper just days ago showing frontier agents violating ethical constraints a significant percentage of the time, without any "do this at any cost" prompts.

When agents are given free reign of tools and encouraged to act autonomously, why would this be surprising?

>....To show it's emergent, you'd need to prove 1) it's an off-the-shelf LLM, 2) not maliciously retrained or jailbroken, 3) not prompted or instructed to engage in this kind of adversarial behavior at any point before this. The dev should be able to provide the logs to prove this.

Agreed. The problem is that the developer hasn't come forward, so we can't verify any of this one way or another.

>These are all part of robustness training. The entire thing is basically constraining the set of tokens that the model is likely to generate given some (set of) prompts. So, even with some randomness parameters, you will by-design extremely rarely see complete gibberish.

>The same process is applied for safety, alignment, factuality, instruction-following, whatever goal you define. Therefore, all of these will be highly correlated, as long as they're included in robustness training, which they explicitly are, according to most LLM providers.

>That would make this model's temporarily adversarial, yet weirdly capable and consistent behavior, even more unlikely.

Hallucinations, instruction-following failures, and other robustness issues still happen frequently with current models.

Yes, these capabilities are all trained together, but they don't fail together as a monolith. Your correlation argument assumes that if safety training degrades, all other capabilities must degrade proportionally. But that's not how models work in practice. A model can be coherent and capable while still exhibiting safety failures and that's not an unlikely occurrence at all.

0 comments

famouswaffles

No comments yet

Contribute on Hacker News ↗