← Back to context

Comment by chinathrow

3 days ago

> At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails.

If you can bypass guardrails, they're, by definition, not guardrails any longer. You failed to do your job.

Nah, the name fits perfectly. Guardrails are there to stop you from serious damage if you lose control and may get off the track. They won't stop you if you're explicitly trying to get off the road, at speed, in as heavy vehicle as you can afford.

  • This definition makes sense, but in the context of LLMs it still feels misapplied. What the model providers call "guardrails" are supposed to prevent malicious uses of the LLMs, and anyone trying to maliciously use the LLM is "explicitly trying to get off the road."