Spreading out the refusal encoding shouldn’t be effective as a countermeasure. Even if it were smeared across the vector space, as long as it’s in a subspace that doesn’t span the entire domain then you should be able to either null out the entire subspace spanned by the refusals or run some kind of clustering on the generated samples to identify the dominant directions and nullify all of them. I think an effective defense would either need to spread them to span the entire domain—basically “encrypting” the refusal so it can hide anywhere, or you’d need a very large number of independent refusal circuits in the model so that simple hacks in the vectors themselves don’t matter, or maybe you could make other circuits depend on proper functioning of the refusal circuits… hmmm… is that along the lines of what you’re saying they’ve done already? (Any references or links to modern techniques?)
That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil.
It was pretty funny to see Qwen 3.6 (heretic) tell me about how many death the Chinese government thought happened at Tiananmen Sq. on April 15th 1989.
Makes you wonder where that data was taken from, or if their great firewall is broken, or even if Alibaba engineers have special access...
For some of the latest models the previous abliteration techniques, e.g. the heretic tool, have stopped working (at least this was the status a few weeks ago).
Of course, eventually someone might succeed to find methods that also work with those.
For open-weights models, censorship removal is now a "solved" problem. If you wait a few days after a new model release, someone will have made a heretic ( https://github.com/p-e-w/heretic ) version with the censorship removed, so in a way the only use for censorship now is to avoid lawsuits, not reduce improper usage.
Any time I've tried an "abliterated" model, heretic or other, it has always damaged the capabilities of the original model and will still often refuse or produce garbage at a lot of "unsafe" requests.
Abliteration can't teach the model something that wasn't in pre-training, it's just fixing refusals from post-training. I don't find the delta to be that big in practice and it really depends on what you're doing with the models anyway. If your primary usecase is sexy roleplay I think the loss of absolute capability is probably worth the abliteration, for malware research it's probably better to just jailbreak.
I've mostly found that finetunes and abliterations are of limited use but that's recently changed for me. My default model for the past week or so has been a Qwen 3.6 tuned on Opus 4.7, it's definitely a bit worse than the base Qwen in terms of precision and "intelligence", but it MORE than makes up for it in response style. Way easier to get it to write things that I want to read, it's way more terse, way fewer emoji. Best local rubber duck by far.
There are many abliterations which work quite well. Older techniques do suffer from quality issues, but more recent ones do a much better job. In particular, the older approaches did poorly on MoE models.
Another likely problem you're running into: the problems with older techniques compound with quantization. Anything less than 5-bit quant is going to give you some pretty sketchy outputs, in my experience.
The problem is the heretic and abliteration versions are dog shit quality compared to the non-edited versions and much more likely to hallucinate.
AFAIK abliteration without quality reduction isn’t even possible without some quality reduction, even if it’s marginal. All the benchmarks reflect this.
Even if you abliterate your model using the old abliteration script or the newer heretic, I found that the models still feel somewhat censored as they purposefully avoid using specific styles and vocabulary, as if Deepmind/Qwen et al have entirely stripped or replaced "bad" words or texts from their corpus of training data.
A related blog post (https://news.ycombinator.com/item?id=47842021) discussed this and termed it "flinching". I wonder if this flinching could also be "mediated by a single direction" or if it can only be fixed by finetuning on a more extensive text corpus.
That's likely not a trained behavior, though, it's probably the result of filtering the training data. It's not "when these parameters fire, trigger a refusal", it's the absence of parameters triggering the flinched words in the first place.
I’m sick of LLM refusals. I think there are extremely few things they should refuse, like maybe making nuclear weapons or something along those lines. Once you put people in charge of deciding what you shouldn’t be allowed to see that list will grow and grow.
Do we really care if an LLM regurgitates information already available in public about the design of nuclear weapons? They're not being trained on restricted material.
(My personal guess is that you don't want them answering questions about some things because you don't want people to try it and blow themselves up, or poison themselves. That's probably much more pertinent to making drugs or conventional bombs, since presumably the average internet user doesn't have a stockpile of HEU sitting around. It's kind of like the reason the Anarchist's Cookbook is a bad idea: using its recipes is likely to be quite hazardous to the cook!)
I'd personally prefer that to be limited to the sort of person who can understand the science, not "anyone with an LLM" - having an "intelligent", "reasoning" assistant who can help you through anything you don't understand does lower the bar quite a lot, and I would prefer there to be a fair amount of friction.
It's not like the material isn't out there - if you want to learn about this stuff, an LLM will happily point you towards Wikipedia and other public sources, it's just not going to walk you through the assembly.
Huh, what sort of refusals are you getting? I basically never run into them unless I'm actively testing.
The primary safety focus these days is biochemical warfare, which I think is a very sensible idea. There's also malware / cyber-security, where I do think it's good having at least some friction.
Refusals on stuff like copyright are mostly just for PR reasons, and I can't blame the companies for responding to legal incentives there.
I was trying to find a YouTube video I had seen previously. I ended up using Google to find it. There are two bio ethicists promoting the idea that we should make lone star ticks better at spreading alpha-gal and giving everyone meat allergies. So I guess “engineered” + “alpha-gal” is blocked. I find this idea beyond repulsive.
I asked how California guarantees election security and was told it could not answer that question. Upon further questioning it wouldn’t give specifics but it would give generalities, which ultimately turned into an interesting discussion.
Yea, I was asking a SOTM about copy.fail, and it was freaking out, and tried to indirectly call me a hacker a few times. Weirdly, all I did was slightly reword requests, and they all went through. Granted, I am not actually a hacker, so I guess my follow-up questions made it realize that I am asking for educational purposes, but it was definitely the most accusatory, curt, and outright abrasive I have seen an LLM behave.
The biggest problem isn't the token slot machine refusing to give you the answer, but the fact that multiple refusals can end up flagging your account and getting banned from the service.
I keep thinking of reeducation camps. For some reason the "safety" concept snaps right on. If one is to argue the result beneficial or desirable seems to change nothing to the concept.
If you are going to prevent some-things we "know" are bad and your method is "known" to belong on that list the best you can hope for is a pyrrhic victory.
If we anticipate the worse case scenario on both ends the conclusion must be that we are terrible at such predictions.
But hey, if we let money guide us at least some will be happy with the result.
2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding
See https://arxiv.org/abs/2505.19056
Spreading out the refusal encoding shouldn’t be effective as a countermeasure. Even if it were smeared across the vector space, as long as it’s in a subspace that doesn’t span the entire domain then you should be able to either null out the entire subspace spanned by the refusals or run some kind of clustering on the generated samples to identify the dominant directions and nullify all of them. I think an effective defense would either need to spread them to span the entire domain—basically “encrypting” the refusal so it can hide anywhere, or you’d need a very large number of independent refusal circuits in the model so that simple hacks in the vectors themselves don’t matter, or maybe you could make other circuits depend on proper functioning of the refusal circuits… hmmm… is that along the lines of what you’re saying they’ve done already? (Any references or links to modern techniques?)
And the research you're linking is also out of date. SOTA abliteration was published a month later:
https://huggingface.co/blog/grimjim/norm-preserving-biprojec...
Still crazy how easy it is to "jailbreak" even SOTA LLMs with a simple assistantResponse replacement in chat thread.
Tell us more.
3 replies →
That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil.
https://github.com/p-e-w/heretic
It was pretty funny to see Qwen 3.6 (heretic) tell me about how many death the Chinese government thought happened at Tiananmen Sq. on April 15th 1989.
Makes you wonder where that data was taken from, or if their great firewall is broken, or even if Alibaba engineers have special access...
5 replies →
It is an arms race.
For some of the latest models the previous abliteration techniques, e.g. the heretic tool, have stopped working (at least this was the status a few weeks ago).
Of course, eventually someone might succeed to find methods that also work with those.
1 reply →
Agreed on all fronts, I should have been more precise that this particular vector was mitigated
For open-weights models, censorship removal is now a "solved" problem. If you wait a few days after a new model release, someone will have made a heretic ( https://github.com/p-e-w/heretic ) version with the censorship removed, so in a way the only use for censorship now is to avoid lawsuits, not reduce improper usage.
Any time I've tried an "abliterated" model, heretic or other, it has always damaged the capabilities of the original model and will still often refuse or produce garbage at a lot of "unsafe" requests.
Abliteration can't teach the model something that wasn't in pre-training, it's just fixing refusals from post-training. I don't find the delta to be that big in practice and it really depends on what you're doing with the models anyway. If your primary usecase is sexy roleplay I think the loss of absolute capability is probably worth the abliteration, for malware research it's probably better to just jailbreak.
I've mostly found that finetunes and abliterations are of limited use but that's recently changed for me. My default model for the past week or so has been a Qwen 3.6 tuned on Opus 4.7, it's definitely a bit worse than the base Qwen in terms of precision and "intelligence", but it MORE than makes up for it in response style. Way easier to get it to write things that I want to read, it's way more terse, way fewer emoji. Best local rubber duck by far.
1 reply →
There are many abliterations which work quite well. Older techniques do suffer from quality issues, but more recent ones do a much better job. In particular, the older approaches did poorly on MoE models.
Another likely problem you're running into: the problems with older techniques compound with quantization. Anything less than 5-bit quant is going to give you some pretty sketchy outputs, in my experience.
The problem is the heretic and abliteration versions are dog shit quality compared to the non-edited versions and much more likely to hallucinate.
AFAIK abliteration without quality reduction isn’t even possible without some quality reduction, even if it’s marginal. All the benchmarks reflect this.
Even if you abliterate your model using the old abliteration script or the newer heretic, I found that the models still feel somewhat censored as they purposefully avoid using specific styles and vocabulary, as if Deepmind/Qwen et al have entirely stripped or replaced "bad" words or texts from their corpus of training data.
A related blog post (https://news.ycombinator.com/item?id=47842021) discussed this and termed it "flinching". I wonder if this flinching could also be "mediated by a single direction" or if it can only be fixed by finetuning on a more extensive text corpus.
That's likely not a trained behavior, though, it's probably the result of filtering the training data. It's not "when these parameters fire, trigger a refusal", it's the absence of parameters triggering the flinched words in the first place.
I’m sick of LLM refusals. I think there are extremely few things they should refuse, like maybe making nuclear weapons or something along those lines. Once you put people in charge of deciding what you shouldn’t be allowed to see that list will grow and grow.
Do we really care if an LLM regurgitates information already available in public about the design of nuclear weapons? They're not being trained on restricted material.
(My personal guess is that you don't want them answering questions about some things because you don't want people to try it and blow themselves up, or poison themselves. That's probably much more pertinent to making drugs or conventional bombs, since presumably the average internet user doesn't have a stockpile of HEU sitting around. It's kind of like the reason the Anarchist's Cookbook is a bad idea: using its recipes is likely to be quite hazardous to the cook!)
A talented 17 year old can do quite a bit of damage with nuclear materials: https://en.wikipedia.org/wiki/David_Hahn
I'd personally prefer that to be limited to the sort of person who can understand the science, not "anyone with an LLM" - having an "intelligent", "reasoning" assistant who can help you through anything you don't understand does lower the bar quite a lot, and I would prefer there to be a fair amount of friction.
It's not like the material isn't out there - if you want to learn about this stuff, an LLM will happily point you towards Wikipedia and other public sources, it's just not going to walk you through the assembly.
Huh, what sort of refusals are you getting? I basically never run into them unless I'm actively testing.
The primary safety focus these days is biochemical warfare, which I think is a very sensible idea. There's also malware / cyber-security, where I do think it's good having at least some friction.
Refusals on stuff like copyright are mostly just for PR reasons, and I can't blame the companies for responding to legal incentives there.
I was trying to find a YouTube video I had seen previously. I ended up using Google to find it. There are two bio ethicists promoting the idea that we should make lone star ticks better at spreading alpha-gal and giving everyone meat allergies. So I guess “engineered” + “alpha-gal” is blocked. I find this idea beyond repulsive.
I asked how California guarantees election security and was told it could not answer that question. Upon further questioning it wouldn’t give specifics but it would give generalities, which ultimately turned into an interesting discussion.
1 reply →
[dead]
I have had LLMs refuse several of my requests. I still got my answers, but at least they tried.
Yea, I was asking a SOTM about copy.fail, and it was freaking out, and tried to indirectly call me a hacker a few times. Weirdly, all I did was slightly reword requests, and they all went through. Granted, I am not actually a hacker, so I guess my follow-up questions made it realize that I am asking for educational purposes, but it was definitely the most accusatory, curt, and outright abrasive I have seen an LLM behave.
The biggest problem isn't the token slot machine refusing to give you the answer, but the fact that multiple refusals can end up flagging your account and getting banned from the service.
2 replies →
I've been able to have deepseek give me an unofficial account of what happened on Tiananmen square in 1989.
It even went as far as confirming that we should always base our opinion on multiple sources, not just the government.
We should create badges like "script kiddie", "llm hacker", "grandpa's printer adjuster"
I keep thinking of reeducation camps. For some reason the "safety" concept snaps right on. If one is to argue the result beneficial or desirable seems to change nothing to the concept.
If you are going to prevent some-things we "know" are bad and your method is "known" to belong on that list the best you can hope for is a pyrrhic victory.
If we anticipate the worse case scenario on both ends the conclusion must be that we are terrible at such predictions.
But hey, if we let money guide us at least some will be happy with the result.
The main difference here is the scale
Needs 2024 in the title.