Comment by Aurornis
1 day ago
Abliterarion is a brute force technique that removes or silences parts of the model. It reduces performance because the abliterated elements aren’t perfectly isolated to censorship so other aspects suffer.
Many of the “uncensored” model providers also do some fine tuning on the models. Some of them target better benchmarks or other measures, but outside of the benchmarks and metrics they’re fine tuned for they are generally noticeably worse than the original model.
The kind of abliteration you are mentioning is no longer state of the art or the most common form of removing the refusal layer in most models. Your your understanding was up to date about a year and a half ago, but has been out of date since after that.
What OP is describing wasn't called abliteration at all.
Abliteration whilst a neologism implies a surgical ablation of refusal.
Earlier approaches post–trained the model to refuse less and, much like other kinds of fine–tuning, it degraded performance. They were "uncensored".
Abliteration has seen some improvement to this day but it always was close to equivalent performance to the original when compared to those earlier techniques.
Unrelated but I’ve been putting off learning about post-abliteration technique and want to use it for an upcoming open source “retraining” project I have on my backlog. I’m not interested in the refusal layers though, more like deep fine tuning but in a way that might let me prune out or consolidate layers, if that makes sense? Do you have any pointers or links to the current SOTA in this area?
I guess I’m looking for a kind of bulk/sticky dropout (which was in fashion way back when I studied DNN in school).
Nowadays it is that Heretic tool is it not? I’ve seen Gemma models uncensored with it.