Comment by pxc

2 months ago

> Read that again. The agent itself enumerates the safety rules it was given and admits to violating every one. This is not me speculating about agent failure modes. This is the agent on the record, in writing.

> The "system rules" the agent is referring to are consistent with Cursor's documented system-prompt language and our project rules for this codebase. Both safeguards failed simultaneously.

It seems like human brains aren't built for the experiences we get with AI agents, where "you can just tell them to do something, and they do it!"... until you can't. It's not a junior dev, it's demented. It's not a magical assistant, it's a demonic assistant, possessed by strange forces that act unexpectedly. All possible metaphors are bad.

I've been reading articles and listening to interviews by a prominent AI booster lately (Yegge), and he talks about a kind of curve of engagement with LLM agents in which "trust goes up", and you delegate more and more work to the LLM as you progress along this curve.

One of the things that always struck me (and struck me as wrong) about his characterization is that running agents in YOLO mode arrives super, super early. It's either the second step or implicit in the first "stage". Why don't people see external sandboxing (or, like the article suggests "auditing token scopes") as a prerequisite to running these agents in environments that have access to production (let alone YOLO modes)? How can the standard answer from AI boosters just be "you WILL lose data. it's a brave new world!"? It's possible to use them without being totally careless. Why not try that?