It's also literally factually incorrect. Pretty much the entire field of mechanistic interpretability would obviously point out that models have an internal definition of what a bug is.
> Thus, we concluded that 1M/1013764 represents a broad variety of errors in code.
(Also the section after "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions")
This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next".
This is more of an article describing their methodology than a full paper. But yes, there's plenty of peer reviewed papers on this topic, scaling sparse autoencoders to produce interpretable features for large models.
There's a ton of peer reviewed papers on SAEs in the past 2 years; some of them are presented at conferences.
Current LLMs do not think. Just because all models anthropomorphize the repetitive actions a model is looping through does not mean they are truly thinking or reasoning.
On the flip side the idea of this being true has been a very successful indirect marketing campaign.
It's also literally factually incorrect. Pretty much the entire field of mechanistic interpretability would obviously point out that models have an internal definition of what a bug is.
Here's the most approachable paper that shows a real model (Claude 3 Sonnet) clearly having an internal representation of bugs in code: https://transformer-circuits.pub/2024/scaling-monosemanticit...
Read the entire section around this quote:
> Thus, we concluded that 1M/1013764 represents a broad variety of errors in code.
(Also the section after "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions")
This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next".
Was this "paper" eventually peer reviewed?
PS: I know it is interesting and I don't doubt Antrophic, but for me it is so fascinating they get such a pass in science.
This is more of an article describing their methodology than a full paper. But yes, there's plenty of peer reviewed papers on this topic, scaling sparse autoencoders to produce interpretable features for large models.
There's a ton of peer reviewed papers on SAEs in the past 2 years; some of them are presented at conferences.
For example: "Sparse Autoencoders Find Highly Interpretable Features in Language Models" https://proceedings.iclr.cc/paper_files/paper/2024/file/1fa1...
"Scaling and evaluating sparse autoencoders" https://iclr.cc/virtual/2025/poster/28040
"Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning" https://proceedings.neurips.cc/paper_files/paper/2024/hash/c...
"Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" https://aclanthology.org/2024.blackboxnlp-1.19.pdf
Modern ML is old school mad science.
The lifeblood of the field is proof-of-concept pre-prints built on top of other proof-of-concept pre-prints.
3 replies →
> This feature fires on actual bugs; it's not just a model pattern matching saying "what a bug hunter may say next".
You don't think a pattern matcher would fire on actual bugs?
Mechanistic interpretability is a joke, supported entirely by non-peer reviewed papers released as marketing material by AI firms.
Some people are still stuck in the “stochastic parrot” phase and see everything regarding LLMs through that lense.
Current LLMs do not think. Just because all models anthropomorphize the repetitive actions a model is looping through does not mean they are truly thinking or reasoning.
On the flip side the idea of this being true has been a very successful indirect marketing campaign.
What does “truly thinking or reasoning” even mean for you?
I don’t think we even have a coherent definition of human intelligence, let alone of non-human ones.
3 replies →