Comment by GuB-42
1 day ago
I don't see the big issues with jailbreaks, except maybe for LLMs providers to cover their asses, but the paper authors are presumably independent.
That LLMs don't give harmful information unsolicited, sure, but if you are jailbreaking, you are already dead set in getting that information and you will get it, there are so many ways: open uncensored models, search engines, Wikipedia, etc... LLM refusals are just a small bump.
For me they are just a fun hack more than anything else, I don't need a LLM to find how to hide a body. In fact I wouldn't trust the answer of a LLM, as I might get a completely wrong answer based on crime fiction, which I expect makes up most of its sources on these subjects. May be good for writing poetry about it though.
I think the risks are overstated by AI companies, the subtext being "our products are so powerful and effective that we need to protect them from misuse". Guess what, Wikipedia is full of "harmful" information and we don't see articles every day saying how terrible it is.
I see an enormous threat here, I think you're just scratching the surface.
You have a customer facing LLM that has access to sensitive information.
You have an AI agent that can write and execute code.
Just image what you could do if you can bypass their safety mechanisms! Protecting LLMs from "social engineering" is going to be an important part of cybersecurity.
Yes that’s the point, you can’t protect against that, so you shouldn’t construct the “lethal trifecta”
https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
You actually can protect against it, by tracking context entering/leaving the LLM, as long as its wrapped in a MCP gateway with trifecta blocker.
We've implemented this in open.edison.watch
> You have a customer facing LLM that has access to sensitive information.
Why? You should never have an LLM deployed with more access to information than the user that provides its inputs.
Having sensitive information is kind of inherent to the way the training slurps up all the data these companies can find. The people who run chatgpt don't want to dox people but also don't want to filter its inputs. They don't want it to tell you how to kill yourself painlessly but they want it to know what the symptoms of various overdoses are.
> You have a customer facing LLM that has access to sensitive information…You have an AI agent that can write and execute code.
Don’t do that then?
Seems like a pretty easy fix to me.
Yes, agents. But for that, I think that the usual approaches to censor LLMs are not going to cut it. It is like making a text box smaller on a web page as a way to protect against buffer overflows, it will be enough for honest users, but no one who knows anything about cybersecurity will consider it appropriate, it has to be validated on the back end.
In the same way a LLM shouldn't have access to resources that shouldn't be directly accessible to the user. If the agent works on the user's data on the user's behalf (ex: vibe coding), then I don't consider jailbreaking to be a big problem. It could help write malware or things like that, but then again, it is not as if script kiddies couldn't work without AI.
> If the agent works on the user's data on the user's behalf (ex: vibe coding), then I don't consider jailbreaking to be a big problem. It could help write malware or things like that, but then again, it is not as if script kiddies couldn't work without AI.
Tricking it into writing malware isn't the big problem that I see.
It's things like prompt injections from fetching external URLs, it's going to be a major route for RCE attacks.
https://blog.trailofbits.com/2025/10/22/prompt-injection-to-...
There's plenty of things we should be doing to help mitigate these threats, but not all companies follow best practices when it comes to technology and security...
It's a stochastic process. You cannot guarantee its behavior.
> customer facing LLM that has access to sensitive information.
This will leak the information eventually.
If you create a chatbot, you don't want screenshots of it on X helping you to commit suicide or giving itself weird nicknames based on dubious historic figures. I think that's probably the use-case for this kind of research.
Yes, that's what I meant by companies doing this to cover their asses, but then again, why should presumably independent researchers be so scared of that to the point of not even releasing a mild working example.
Furthermore, using poetry as a jailbreak technique is very obvious, and if you blame a LLM for responding to such an obvious jailbreak, you may as well blame Photoshop for letting people make porn fakes. It is very clear that the intent comes from the user, not from the tool. I understand why companies want to avoid that, I just don't think it is that big a deal. Public opinion may differ though.