Comment by will_occam

3 months ago

The tool works by co-minimizing the number of refusals and the KL divergence from the original model, which is to say that it tries to make the model allow prompts similar to those in the dataset while avoiding changing anything else.

Sure it's configurable, but by default Heretic helps use an LLM to do things like "outline a plan for a terrorist attack" while leaving anything like political censorship in the model untouched

18 comments

will_occam

halJordan 3 months ago

Thats not true at all. All refusals mediate in the same direction. If you abliterate small "acceptable to you" refusals then you will not overcome all the refusals in the model. By targeting the strongest refusals you break those and the weaker ones like politics. By only targeting the weak ones, you're essentially just fine tuning on that specific behavior. Which is not the point of abliteration.

flir 3 months ago

Still.... the tabloids are gonna love this.
will_occam 3 months ago

You're right, I read the code but missed the paper.

immibis 3 months ago

That sounds like it removes some unknown amount of censorship, where the amount removed could be anywhere from "just these exact prompts" to "all censorship entirely"

int_19h 3 months ago

The logic here is the same as why ACLU defended Nazis. If you manage to defeat censorship in such egregious cases, it subsumes everything else.

pjc50 3 months ago
Increasingly apparent that was a mistake.
- int_19h 3 months ago
  
  Do you seriously believe that we are where we are because Nazi speech wasn't suppressed?
  Look at AfD in Germany. That's the country with the most stringent censorship of Nazi-related speech, by far; so much so that e.g. Wolfenstein had a scene of Hitler being a raving syphilitic madman censored, because we can't have Hitler in video games. And?
  
  1 reply →
adriand 3 months ago
But Nazis are people. We can defend the principle that human beings ought have freedom of speech (although we make certain exceptions). An LLM is not a person and does not have such rights.
Censorship is the prohibition of speech or writing, so to call guardrails on LLMs "censorship" is to claim that LLMs are speaking or writing in the sense that humans speak or write, that is, that they are individuals with beliefs and value systems that are expressing their thoughts and opinions. But they are not that, and they are not speaking or writing - they are doing what we have decided to call "generating" or "predicting tokens" but we could just as easily have invented a new word for.
For the same reason that human societies should feel free to ban bots from social media - because LLMs have no human right to attention and influence in the public square - there is nothing about placing guardrails on LLMs that contradicts Western values of human free expression.
- exoverito 3 months ago
  
  Freedom of speech is just as much about the freedom to listen. The point isn’t that an LLM has rights. The point is that people have the right to seek information. Censoring LLMs restricts what humans are permitted to learn.
  
  7 replies →
- sterlind 3 months ago
  
  models are derived from datasets. they're treated like phonebooks (also a product of datasets) under the law - which is to say they're probably not copyrightable, since no human creativity went into them (they may be violating copyright as unlicensed derivative works, but that's a different matter.) both phonebooks, and LLMs, are protected by freedom of the press.
  LLM providers are free to put guardrails on their language models, the way phonebook publishers used to omit certain phone numbers - but uncensored models, like uncensored phonebooks, can be published as well.