Comment by gcr
3 hours ago
wait, SVD / zeroing out the first principal component is an unsupervised technique. The earlier difference-of-means technique relies on the knowledge of which outputs are refusals and which aren’t. How would SVD be able to accomplish this without labels?
edit: the reference is https://arxiv.org/pdf/2512.18901
they are randomly sampling two sets of refusal/nonrefusal activation vectors, stacking them, and taking the elementwise difference between these two matrices. Then they use SVD to get the k top principal components. These are the directions they zero out.
Seems to me that the top principal component should be roughly equivalent to the difference-of-means vector, but wouldn’t the other PCs just capture the variance among the distributions of points sampled? I don’t understand why that’s desirable
No comments yet
Contribute on Hacker News ↗