Comment by mrbungie
1 day ago
Was this "paper" eventually peer reviewed?
PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science.
1 day ago
Was this "paper" eventually peer reviewed?
PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science.
This is more of an article describing their methodology than a full paper. But yes, there's plenty of peer reviewed papers on this topic, scaling sparse autoencoders to produce interpretable features for large models.
There's a ton of peer reviewed papers on SAEs in the past 2 years; some of them are presented at conferences.
For example: "Sparse Autoencoders Find Highly Interpretable Features in Language Models" https://proceedings.iclr.cc/paper_files/paper/2024/file/1fa1...
"Scaling and evaluating sparse autoencoders" https://iclr.cc/virtual/2025/poster/28040
"Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning" https://proceedings.neurips.cc/paper_files/paper/2024/hash/c...
"Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" https://aclanthology.org/2024.blackboxnlp-1.19.pdf
Modern ML is old school mad science.
The lifeblood of the field is proof-of-concept pre-prints built on top of other proof-of-concept pre-prints.
Sounds like you agree this “evidence” lacks any semblance of scientific rigor?
(Not GP) There was a well recognized reproducibility problem in the ML field before LLM-mania, and that's considering published papers with proper peer-reviews. The current state of afairs in some ways is even less rigourous than that, and then some people in the field feel free to overextend their conclusions into other fields like neurosciences.
Frankly, I don't see a reason to give a shit.
We're in the "mad science" regime because the current speed of progress means adding rigor would sacrifice velocity. Preprints are the lifeblood of the field because preprints can be put out there earlier and start contributing earlier.
Anthropic, much as you hate them, has some of the best mechanistic interpretability researchers and AI wranglers across the entire industry. When they find things, they find things. Your "not scientifically rigorous" is just a flimsy excuse to dismiss the findings that make you deeply uncomfortable.