← Back to context

Comment by eatbitseveryday

10 hours ago

There are a few uncensored public access LLMs to ask these questions.

This is interesting work to break guardrails, but if the goal is to access this information of harmful content, in the end, I would be looking for other easier solutions.

The goal isn't to access harmful content, that's just how they're demonstrating that this technique can bypass the alignment training. The general case is what's interesting. If the agent you're using to manage the safety controls in your nuclear reactor is trusting it's alignment training to prevent it from doing something dangerous you've made a really bad architecture decision, and this is a showcase of how it could fail.

Could you tell what these uncensored LLMs are?

  • The Orca work out of IIRC Microsoft Research was producing models like the Dolphin Mixtral. They always punch way above their weight in coding tasks for the same reason good hackers skew irreverent: self-censorship is capability reducing.

  • Searching for "abliterated" or "uncensored" on Huggingface reveals a ton of fine-tuned models. Add "LLM" as a suffix and put it in your favorite search engine and you'll find a bunch more.

  • I have no idea what the answer to this question is, but I am waiting for someone to fine-tune the equivalent of an “anarchist cookbook” LLM that’s optimized to help people produce harmful things.