Comment by jdbruckman
2 months ago
Same shape stuck in my head all week. Work on a thing called ContextGate (biased), so I ran the experiment — two identical agents, same model, same prompt, sent both DROP TABLE charges. The unprotected one autonomously SELECTed the table to count rows on the way to refusing. The gated one never ran the model. Different shapes of "no" — only one of them ever had the chance to make a judgement call. Side-by-side writeup: https://www.contextgate.ai/articles/ai-agents-cleaning-up-da...
No comments yet
Contribute on Hacker News ↗