← Back to context

Comment by HarHarVeryFunny

6 days ago

Yes, although the sparsity doesn't need to be inherent to the model - another approach is to try to decode the learned weights using approaches like sparse auto-encoders or transcoders.

https://transformer-circuits.pub/2025/attribution-graphs/met...

I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.

See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093