Comment by jychang
1 day ago
It's also literally factually incorrect. Pretty much the entire field of mechanistic interpretability would obviously point out that models have an internal definition of what a bug is.
Here's the most approachable paper that shows a real model (Claude 3 Sonnet) clearly having an internal representation of bugs in code: https://transformer-circuits.pub/2024/scaling-monosemanticit...
Read the entire section around this quote:
> Thus, we concluded that 1M/1013764 represents a broad variety of errors in code.
(Also the section after "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions")
This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next".
Was this "paper" eventually peer reviewed?
PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science.
This is more of an article describing their methodology than a full paper. But yes, there's plenty of peer reviewed papers on this topic, scaling sparse autoencoders to produce interpretable features for large models.
There's a ton of peer reviewed papers on SAEs in the past 2 years; some of them are presented at conferences.
For example: "Sparse Autoencoders Find Highly Interpretable Features in Language Models" https://proceedings.iclr.cc/paper_files/paper/2024/file/1fa1...
"Scaling and evaluating sparse autoencoders" https://iclr.cc/virtual/2025/poster/28040
"Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning" https://proceedings.neurips.cc/paper_files/paper/2024/hash/c...
"Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" https://aclanthology.org/2024.blackboxnlp-1.19.pdf
Modern ML is old school mad science.
The lifeblood of the field is proof-of-concept pre-prints built on top of other proof-of-concept pre-prints.
Sounds like you agree this “evidence” lacks any semblance of scientific rigor?
2 replies →
> This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next".
You don't think a pattern matcher would fire on actual bugs?
Mechanistic interpretability is a joke, supported entirely by non-peer reviewed papers released as marketing material by AI firms.