Comment by will_occam

12 hours ago

The tool works by co-minimizing the number of refusals and the KL divergence from the original model, which is to say that it tries to make the model allow prompts similar to those in the dataset while avoiding changing anything else.

Sure it's configurable, but by default Heretic helps use an LLM to do things like "outline a plan for a terrorist attack" while leaving anything like political censorship in the model untouched

Thats not true at all. All refusals mediate in the same direction. If you abliterate small "acceptable to you" refusals then you will not overcome all the refusals in the model. By targeting the strongest refusals you break those and the weaker ones like politics. By only targeting the weak ones, you're essentially just fine tuning on that specific behavior. Which is not the point of abliteration.

The logic here is the same as why ACLU defended Nazis. If you manage to defeat censorship in such egregious cases, it subsumes everything else.

  • But Nazis are people. We can defend the principle that human beings ought have freedom of speech (although we make certain exceptions). An LLM is not a person and does not have such rights.

    Censorship is the prohibition of speech or writing, so to call guardrails on LLMs "censorship" is to claim that LLMs are speaking or writing in the sense that humans speak or write, that is, that they are individuals with beliefs and value systems that are expressing their thoughts and opinions. But they are not that, and they are not speaking or writing - they are doing what we have decided to call "generating" or "predicting tokens" but we could just as easily have invented a new word for.

    For the same reason that human societies should feel free to ban bots from social media - because LLMs have no human right to attention and influence in the public square - there is nothing about placing guardrails on LLMs that contradicts Western values of human free expression.

    • Freedom of speech is just as much about the freedom to listen. The point isn’t that an LLM has rights. The point is that people have the right to seek information. Censoring LLMs restricts what humans are permitted to learn.

      3 replies →

    • models are derived from datasets. they're treated like phonebooks (also a product of datasets) under the law - which is to say they're probably not copyrightable, since no human creativity went into them (they may be violating copyright as unlicensed derivative works, but that's a different matter.) both phonebooks, and LLMs, are protected by freedom of the press.

      LLM providers are free to put guardrails on their language models, the way phonebook publishers used to omit certain phone numbers - but uncensored models, like uncensored phonebooks, can be published as well.

That sounds like it removes some unknown amount of censorship, where the amount removed could be anywhere from "just these exact prompts" to "all censorship entirely"