Comment by Retr0id

8 hours ago

I don't know if this particular tool/approach is legit, but LLM ablation is definitely a thing: https://arxiv.org/abs/2512.13655

3 comments

Retr0id

D-Machine 6 hours ago

Doesn't look legit to me. You are talking about abliteration, which is real. But the OP linked tool is doing novel and very dumb ablation: zeroing out huge components of the network, or zeroing out isolated components in a way that indicates extreme ignorance of the basic math involved.

Compared to abliteration, none of the ablation approaches of this tool make even half a whit of sense if you understand even the most basic aspects of an e.g. Transformer LLM architecture, so my guess is this is BS.

hexaga 1 hour ago
The terminology comes from the post[0] which kicked off interest in orthogonalizing weights w.r.t. a refusal direction in the first place. That is, abliteration was not originally called abliteration, but refusal ablation.
Ultimately though, OP is just what you get if you take the idea of abliteration and tell an LLM to fix the core problems: that refusal isn't actually always exactly a rank-1 subspace, nor the same throughout the net, nor nicely isolated to one layer/module, that it damages capabilities, and so on.
The model looks at that list and applies typical AI one-off 'workarounds' to each problem in turn while hyping up the prompter, and you get this slop pile.
[0]: https://www.lesswrong.com/posts/refusal-in-llms-is-mediated-...
- jandrese 20 minutes ago
  
  No offense, but a Lesswrong link is an immediate yellow flag, especially on the topic of AI. I can’t say if that article in particular is bad, but it is associating with a whole lot of abject nonsense written by people who get high on their own farts.