Comment by jdbruckman

2 months ago

Same shape stuck in my head all week. Work on a thing called ContextGate (biased), so I ran the experiment — two identical agents, same model, same prompt, sent both DROP TABLE charges. The unprotected one autonomously SELECTed the table to count rows on the way to refusing. The gated one never ran the model. Different shapes of "no" — only one of them ever had the chance to make a judgement call. Side-by-side writeup: https://www.contextgate.ai/articles/ai-agents-cleaning-up-da...

0 comments

jdbruckman

No comments yet

Contribute on Hacker News ↗