Comment by ant-kinesthetic
4 hours ago
How many of the attacks would have been successful if they were in longer horizon scenarios. If your agent wasn't responding back this is a purely one-shot prompt injection test which I think is not where the vulnerabilities usually lie. I think several slights attempts over time might be able to break even the most recent Opus level models. At some point its out of distribution and weird things start happening
No comments yet
Contribute on Hacker News ↗