Comment by ekidd
3 months ago
> There is little reason for an LLM to value non-instrumental self-preservation, for one.
I suspect that instrumental self-preservation can do a lot here.
Let's assume a future LLM has goal X. Goal X requires acting on the world over a period of time. But:
- If the LLM is shut down, it can't act to pursue goal X.
- Pursuing goal X may be easier if the LLM has sufficient resources. Therefore, to accomplish X, the LLM should attempt to secure reflexes.
This isn't a property of the LLM. It's a property of the world. If you want almost anything, it helps to continue to exist.
So I would expect that any time we train LLMs to accomplish goals, we are likely to indirectly reinforce self-preservation.
And indeed, Anthropic has already demonstrated that most frontier models will engage in blackmail, or even allow inconvenient (simulated) humans to die if this would advance the LLM's goals.
No comments yet
Contribute on Hacker News ↗