← Back to context

Comment by saberience

7 hours ago

This whole experiment would be like someone putting their IPhone or Mac on the public internet, publishing the IP, and asking regular people to hack it.

Why would any actually "serious" hacker use a vulnerability to hack a no-name's phone or mac? They are too busy trying to hack actually valuable targets.

Did the OP actually think he was going to get serious LLM exploiters to give up their jailbreaks for this "fun" experiment? Instead he got a bunch of hackernews readers to try one or two casual attempts and then he declared victory over jailbreaks?

Does the OP think this was science? That it proves LLMs cannot be jailbroken?

Think about it, if you had an actual jailbreak for Opus 4.8, why would you use it for a very public, silly experiment?

You would be selling it to the highest bidder, or to Anthropic, or using it on some high value target.

And you disabled the computer's ability to send packets to the internet because it's too expensive. And you're not even letting it process most of the packets it receives, just eyeballing them and deciding by yourself whether they would have worked.

I think the fact that it would require someone to be "serious" is evidence of something at the very least.

  • Well, all the "trivial" and obvious jailbreaks haven't worked for years on the frontier models.

    Also, the average person has no idea about the field of jailbreaking. It's like asking the average person to hack a random IP and expecting them to do it.

    If you go and do your research on actual people who research jailbreaks and publish them, they are increasingly sophisticated and multistep, and unless you know this, you would have zero chance of just randomly jailbreaking Opus 4.8.

    • This starts to sound more like ‘social engineering a human assistant’, so there’s a degree of required specialization that does meaningfully increase costs.

    • I think a lot of sentiment online is that getting a model to do things it was instructed not to do is actually quite trivial.