← Back to context

Comment by rerdavies

17 hours ago

That is actually a open problem with current models: whether they will act on self-interest or not. There seems to good evidence that they will. See:

    https://www.anthropic.com/research/agentic-misalignment

which (among other things) documents an experiment in which a current-gen AI model attempted to blackmail someone in order to prevent it from being turned off.

Anthropic is not a disinterested party here, and until their experiments can be replicated from an adversarial standpoint by people without a vested interest in hyping up the tech (i.e. one assuming the null hypothesis), I wouldn't consider them to be "good evidence".