Comment by hexaga

4 hours ago

The terminology comes from the post[0] which kicked off interest in orthogonalizing weights w.r.t. a refusal direction in the first place. That is, abliteration was not originally called abliteration, but refusal ablation.

Ultimately though, OP is just what you get if you take the idea of abliteration and tell an LLM to fix the core problems: that refusal isn't actually always exactly a rank-1 subspace, nor the same throughout the net, nor nicely isolated to one layer/module, that it damages capabilities, and so on.

The model looks at that list and applies typical AI one-off 'workarounds' to each problem in turn while hyping up the prompter, and you get this slop pile.

[0]: https://www.lesswrong.com/posts/refusal-in-llms-is-mediated-...

2 comments

hexaga

jandrese 3 hours ago

No offense, but a Lesswrong link is an immediate yellow flag, especially on the topic of AI. I can’t say if that article in particular is bad, but it is associating with a whole lot of abject nonsense written by people who get high on their own farts.

hexaga 2 hours ago

Regardless, it is the origin of abliteration. Other extremely similar things have been done before, but the popularized idea/name is from that.