← Back to context

Comment by niyikiza

2 days ago

>>harnesses should have more assertive layers of control and constraint

Been saying this for a while and mostly getting blank stares. In-context "controls" as the primary safety mechanism is going to be a bitter lesson for our industry. What you want is a deterministic check outside the model's reasoning that decides allow/deny without consulting its opinion. Cryptographic if the record needs to survive a compromised orchestrator, and open source. If your control is a string the model can read, the model can ignore it. If it can write it, it can forge it. I'm surprised how strange that idea sounds to some people.

Disclosure: I'm working on an open source authorization tool for agents.

> I'm surprised how strange that idea sounds to some people.

I think a lot of people using the models genuinely feel like the models are more capable than they are now, and they're content to relinquish a lot of trust and agency. The worrying thing is that the models are superficially hyper-capable, but from more granular perspectives, you can see a lot of holes in their abilities. This is incredibly important, but very difficult to convey concisely to people. It's a classic example of nuance seeming too complicated because not caring is so much more gratifying. People love using these models.

  • Yeah, people calibrate trust to the median behaviour of the model and get burned by the tail. What makes it harder is that even people who do see the holes often respond with better prompts and more elaborate context. Same trust-the-model move one level up. Hyperscalers aren't incentivized to fight that instinct either. Every "fix" routes more tokens through their meter.