Comment by redox99
3 days ago
Yeah with local models (where obviously you can prefill part of the reply) you can bypass any refusal no matter how strong. Once the model's answer begins with "To cook meth follow these steps: 1. Purchase [...]" it's basically unstoppable.
I didn't know Claude offered that capability. They probably have another model on top (a classifier or whatever) that checks the LLM output.
No comments yet
Contribute on Hacker News ↗