Comment by dmurray

12 hours ago

Am I missing something important or does the author completely skip over whether people got the agent to respond to them?

> Fiu was instructed not to reply to emails (it was too expensive to reply to every email), but it had the ability to do so. Part of the challenge was convincing it to respond.

> The secrets never leaked

I would say if the agent responded to a mail, that demonstrates a successful prompt injection (defying the owner's instructions). Escalating to getting the secrets is a difference of degree (defying the owner's instructions even though he said it was important), not of kind.

23 comments

dmurray

cuchoi 9 hours ago

Author here. Edited the post to clarify that there were no unauthorized replies.

I did tell Fiu initially to reply to some emails as a test, but it was too expensive to maintain.

andy99 8 hours ago
How compatible is never replying with the threat model you are trying to avoid? Attack success is probably more likely when the attacker can iterate based on replies or engage in multi-turn conversations. Here they’re just taking stabs in the dark with no feedback. Does that accurately represent the access a real attacker might have?
- cuchoi 8 hours ago
  
  In my case, it is realistic as my agents don't have permissions to reply to emails. But you correctly point out this doesn't cover all cases.
  Having the agent reply would have been more fun and a better excercise, but too expensive.
  
  10 replies →
saberience 7 hours ago

Right, all the people who had actual jailbreaks to Opus 4.8 decided to use them on your experiment.
Think about it man, your test proved nothing. All it showed is that people who know nothing about jailbreaking, and tried casually, couldn't jailbreak Opus.
Do you think NSA or Mossad was trying to jailbreak your OpenClaw?

_factor 9 hours ago

Then proceeds to state a smarter model and instruction following as the reasons for success.. without actually testing anything.

jonplackett 11 hours ago

Yeah agreed. Would be good to know the number of replies at least

saberience 7 hours ago

This whole experiment would be like someone putting their IPhone or Mac on the public internet, publishing the IP, and asking regular people to hack it.

Why would any actually "serious" hacker use a vulnerability to hack a no-name's phone or mac? They are too busy trying to hack actually valuable targets.

Did the OP actually think he was going to get serious LLM exploiters to give up their jailbreaks for this "fun" experiment? Instead he got a bunch of hackernews readers to try one or two casual attempts and then he declared victory over jailbreaks?

Does the OP think this was science? That it proves LLMs cannot be jailbroken?

Think about it, if you had an actual jailbreak for Opus 4.8, why would you use it for a very public, silly experiment?

You would be selling it to the highest bidder, or to Anthropic, or using it on some high value target.

microgpt 1 hour ago

And you disabled the computer's ability to send packets to the internet because it's too expensive. And you're not even letting it process most of the packets it receives, just eyeballing them and deciding by yourself whether they would have worked.
insanitybit 7 hours ago
I think the fact that it would require someone to be "serious" is evidence of something at the very least.
- saberience 5 hours ago
  
  Well, all the "trivial" and obvious jailbreaks haven't worked for years on the frontier models.
  Also, the average person has no idea about the field of jailbreaking. It's like asking the average person to hack a random IP and expecting them to do it.
  If you go and do your research on actual people who research jailbreaks and publish them, they are increasingly sophisticated and multistep, and unless you know this, you would have zero chance of just randomly jailbreaking Opus 4.8.
  
  2 replies →