Comment by oli5679
6 days ago
This ties directly into the superposition theory.
It is believed dense models cram many features into shared weights, making circuits hard to interpret.
Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.
Yes, although the sparsity doesn't need to be inherent to the model - another approach is to try to decode the learned weights using approaches like sparse auto-encoders or transcoders.
https://transformer-circuits.pub/2025/attribution-graphs/met...
I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.
See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093