Comment by existencebox

13 hours ago

I'd argue you are still using a sandbox, just at a higher ring (outside the machine/VM) and relaying on app/resource level permissions on each of your external resources to enforce it, which requires _all_ of those external systems to be hardened vs. the agent host itself. The capabilities a full machine has for exploring and exploiting external, ostensibly secured systems, has already been touched on via incidents like the anthropic internal model jailbreak. [0]

Giving the whole machine also doesn't answer the question for how the agent can hook into actions that eventually require more perms, and even if you "airgap" those via things like output queues that humans need to approve, that still feels "harnessey" to me.

I feel a bit guilty of debating semantics here, especially as I can't/don't intend to convey any confidence in a "right answer", but my reason for being pedantic is that I do think there are interesting tradeoffs between "P(jailbreak or unexpected capability use|time)" and "increasing power/available capability set", as well as interesting primitives emerging in terms of the components you'd need regardless of where you drew that line (ala paragraph 2.)

[0] - https://www-cdn.anthropic.com/3edfc1a7f947aa81841cf88305cb51... (specifically section 5.5.2.4)

2 comments

existencebox

tptacek 13 hours ago

The post is explicit about what they mean by sandboxing and what the tradeoffs are for the model they're discussing.

existencebox 12 hours ago

I've reread it and I stand by my statements that it's an isomorphism, simply replace "container" with "machine AAD/auth-system boundaries" in your example.
The "Your credentials stay out of the sandbox" problem, to quote them, is what I see your "require your perms system to enforce it" as implicitly solving for.
(Their "sandbox as cattle" discussion had less bearing on the "which pattern" question to me, since I tend to treat most parts of my agent stack as cattle-like, potentially out of a bias towards that architecture broadly, as I find it's much easier to reason about when as much as possible is disposable/idempotent/eventually consistent. The durable execution point also assumed aspects of the agent scaffold ala prompts don't have to be turned over in deploy, or conversely, can't finish their tasks and then migrate incrementally, and while I might cynically raise an eyebrow at the focus on 25ms for sandbox calls given the dev loops I currently experience, I'd argue there are other ways to solve that problem in both an in or outside of container sandbox pattern.)
I'd even agree with their final point "Consistency is the part we haven't answered" but in a different angle than they intended, as to why my focus was on "how do you _constrain_ agent behavior" since that has been, in my experience, the biggest bottleneck to letting agents do more.