Comment by veganmosfet
13 hours ago
It would be nice to publish the exact setup used (workspace dump, OpenClaw version, ...) to be able to reproduce and try out more payloads.
In general I have mixed feelings about this result: sure, opus4.6 is excellent at following user intent and recognise potential prompt injection attempts. But: Is the "security" prompt used realistic for a generic use-case (processing of emails)? I guess not.
In my experiments - without this specific prompt - I was able to derail the user intent to make opus4.8 download and execute a malicious script [0] just by asking "Summarize my new emails".
Thanks for sharing your article, very interesting.
I used https://github.com/openclaw/openclaw-ansible and configured a heartbeat (using Openclaw's terms) to check emails every hour. Had to do a bit more to make sure it had new context for every email.
Nice write-up! I saw some earlier posts were submitted here, but not that one - so I tried submitting it:
https://news.ycombinator.com/item?id=48686947
Thanks! I tried to submit the posts but for some reason my submissions are not published in HN any more. I tried to reach out to HN admins but no response so far.