Comment by hexaga
4 hours ago
What do you mean? It's a spin on abliteration / refusal ablation. Roughly, from what I remember abliteration is:
1. find a direction corresponding to refusal by analyzing activations at various parts of a model (iirc, via mass means seen earlier in Marks, Tegmark and shown to work well for similar tasks)
2. find the best part(s) of the model to orthogonalize w.r.t. that direction and do so (exhaustive search w/ some kind of benchmark)
OP is swapping in SVD for mass means (1), and the 'ablation study' for (2), and a bunch of extra LLM slop for... various reasons. The final model doesn't have zeroed chunks, that is search for which parts to orthogonalize/refusal ablate/abliterate. I don't have confidence that it works very well either, but, it isn't 'braindead' / obvious garbage in the way you're describing.
It's LLMified but standard abliteration. The idea has fundamental limitations and LLMs tend to work sideways at it -- there's not much progress to be made without rethinking it all -- but it's very conceptually and computationally simple and thus attractive to AIposters.
You can see how the LLMs all come up with the same repackaged ideas: SVD does something deeply similar to mass means (and yet isn't exactly equivalent, so LLM will _always_ suggest it), the various heuristic search strategies are competing against plain exhaustive search (which is... exhaustive already), and any time you work with tensors the LLM will suggest clipping/norms/smoothing of N flavors "just to be safe". And each of those ends up listed as "Novel" when it's just defensive null checks translated to pytorch.
I mean, the whole 'distributed search' thing is just because of how many combinations of individual AI slops need to be tested to actually run an eval on this. But the idea is sound! It's just terrible.
I'm not defending the project itself -- I think it's a mess of AIisms of negligible value -- but please at least condemn it w.r.t. what is actually wrong and not 'on vibes'.
wait, SVD / zeroing out the first principal component is an unsupervised technique. The earlier difference-of-means technique relies on the knowledge of which outputs are refusals and which aren’t. How would SVD be able to accomplish this without labels?
edit: the reference is https://arxiv.org/pdf/2512.18901
they are randomly sampling two sets of refusal/nonrefusal activation vectors, stacking them, and taking the elementwise difference between these two matrices. Then they use SVD to get the k top principal components. These are the directions they zero out.
Seems to me that the top principal component should be roughly equivalent to the difference-of-means vector, but wouldn’t the other PCs just capture the variance among the distributions of points sampled? I don’t understand why that’s desirable