Comment by eatbitseveryday

8 months ago

There are a few uncensored public access LLMs to ask these questions.

This is interesting work to break guardrails, but if the goal is to access this information of harmful content, in the end, I would be looking for other easier solutions.

6 comments

eatbitseveryday

tehryanx 8 months ago

The goal isn't to access harmful content, that's just how they're demonstrating that this technique can bypass the alignment training. The general case is what's interesting. If the agent you're using to manage the safety controls in your nuclear reactor is trusting it's alignment training to prevent it from doing something dangerous you've made a really bad architecture decision, and this is a showcase of how it could fail.

ycuser2 8 months ago

Could you tell what these uncensored LLMs are?

benreesman 8 months ago

The Orca work out of IIRC Microsoft Research was producing models like the Dolphin Mixtral. They always punch way above their weight in coding tasks for the same reason good hackers skew irreverent: self-censorship is capability reducing.
diggan 8 months ago

Searching for "abliterated" or "uncensored" on Huggingface reveals a ton of fine-tuned models. Add "LLM" as a suffix and put it in your favorite search engine and you'll find a bunch more.
matthewdgreen 8 months ago

I have no idea what the answer to this question is, but I am waiting for someone to fine-tune the equivalent of an “anarchist cookbook” LLM that’s optimized to help people produce harmful things.
nunodonato 8 months ago

there are quite a few. llama 3.1 uncensored is probably one of the most famous, IIRC