Weight-sparse transformers have interpretable circuits [pdf]

14 days ago (cdn.openai.com)

This ties directly into the superposition theory.

It is believed dense models cram many features into shared weights, making circuits hard to interpret.

Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.

  • Yes, although the sparsity doesn't need to be inherent to the model - another approach is to try to decode the learned weights using approaches like sparse auto-encoders or transcoders.

    https://transformer-circuits.pub/2025/attribution-graphs/met...

    • I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.

      See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093

I find this fascinating, as it raises the possibility of a single framework that can unify neural and symbolic computation by "defuzzing" activations into what are effectively symbols. Has anyone looked at the possibility of going the other way, by fuzzifying logical computation?

We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.

  • For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.

    • yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.

      Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.

  • Yes! I'de been advocating for it inside the industry for a decade, but it is an uphill battle. The researchers can't easily publish that kind of work (even Google researchers) because you don't have the hardware that can realistically train decently large models. The hardware companies don't want to take the risk a rethinking the architecture CPU or accelerator for sparse compute because there are no large existing customers.

  • There also needs to be tools that can author that code!

    Im starting to dust off some ideas I developed over a decade ago to build such a toolkit. Recently realized “egads, my stuff can express almost every major gpu / cpu optimization that’s relevant for modern deep learning… need to do a new version with an eye towards adoption in that area”. Plus every flavor of sparse.

    Also need to figure out if some of the open core ideas i have in mind would be attractive to early stage investors who focus on the so-called deep tech end of the space. Definitely looks like ill have to do ye olde ask friends and acquaintances if they can point me to those folks approach since cold reach out historically is full of fail

  • Deep Learning models would work way better with much higher dimensional sparse vectors

    Citations?

    • There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.

      The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.

      18 replies →

  • My last dive into matrix computations was years ago, but the need was the same back then. We could sparsify matrices pretty easily, but the infrastructure was lacking. Some things never change.

>"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits.

Sparse circuits are defined as a set of nodes connected by edges."

...which could also be considered/viewed as Graphs...

(Then from earlier in the paper):

>"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them.

And (jumping around a bit more in the paper):

>"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network."

A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...)

Anyway, great paper!

I worked on a similiar problem about a year ago, on large dense models.

https://www.lesswrong.com/posts/PkeB4TLxgaNnSmddg/scaling-sp...

In both cases, the goal is to actually learn a concrete circuit inside a network that solves specific Python next-token prediction tasks. We each end up with a crisp wiring diagram saying “these are the channels/neurons/heads that implement this particular bit of Python reasoning.”

Both projects cast circuit discovery as a gradient-based selection problem over a fixed base model. We train a mask that picks out a sparse subset of computational nodes as “the circuit,” while the rest are ablated. Their work learns masks over a weight-sparse transformer; ours learns masks over SAE latents and residual channels. But in both cases, the key move is the same: use gradients to optimize which nodes are included, rather than relying purely on heuristic search or attribution patching. Both approaches also use a gradual hardening schedule (continuous masks that are annealed or sharpened over time) so that we can keep gradients useful early on, then spend extra compute to push the mask towards a discrete, minimal circuit that still reproduces the model’s behavior.

The similarities extend to how we validate and stress-test the resulting circuits. In both projects, we drill down enough to notice “bugs” or quirks in the learned mechanism and to deliberately break it: by making simple, semantically small edits to the Python source, we can systematically cause the pruned circuit to fail and those failures generalize to the unpruned network. That gives us some confidence that we’re genuinely capturing the specific mechanism the model is using.

Related:

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning – https://arxiv.org/pdf/2505.17117 (Lecun/Jurafsky)

> Large Language Models (LLMs) demonstrate striking linguistic capabilities that suggest semantic understanding (Singh et al., 2024; Li et al., 2024). Yet, a critical question remains unanswered: Do 1arXiv:2505.17117v5 [cs.CL] 25 Sep 2025LLMs navigate the compression-meaning trade-off similarly to humans, or do they employ fundamentally different representational strategies? This question matters because true understanding, which goes beyond surface-level mimicry, requires representations that balance statistical efficiency with semantic richness (Tversky, 1977; Rosch, 1973b).

> To address this question, we apply Rate-Distortion Theory (Shannon, 1948) and Information Bottleneck principles (Tishby et al., 2000) to systematically compare LLM and human conceptual structures. We digitize and release seminal cognitive psychology datasets (Rosch, 1973b; 1975; McCloskey & Glucksberg, 1978), which are foundational studies that shaped our understanding of human categorization but were previously unavailable in a machine-readable form. These benchmarks, comprising 1,049 items across 34 categories with both membership and typicality ratings, offer unprecedented empirical grounding for evaluating whether LLMs truly understand concepts as humans do. It also offers much better quality data than the current crowdsourcing paradigm.

From typicality tests in the paper above, we can jump to:

The Guppy Effect as Interference – https://arxiv.org/abs/1208.2362

> One can refer to the situation wherein people estimate the typicality of an exemplar of the concept combination as more extreme than it is for one of the constituent concepts in a conjunctive combination as overextension. One can refer to the situation wherein people estimate the typicality of the exemplar for the concept conjunction as higher than that of both constituent concepts as double overextension. We posit that overextension is not a violation of the classical logic of conjunction, but that it signals the emergence of a whole new concept. The aim of this paper is to model the Guppy Effect as an interference effect using a mathematical representation in a complex Hilbert space and the formalism of quantum theory to represent states and calculate probabilities. This builds on previous work that shows that Bell Inequalities are violated by concepts [7, 8] and in particular by concept combinations that exhibit the Guppy Effect [1, 2, 3, 9, 10], and add to the investigation of other approaches using interference effects in cognition [11, 12, 13].

And from quantum interferences

Quantum-like contextuality in large language models – https://royalsocietypublishing.org/doi/epdf/10.1098/rspa.202...

> This paper provides the first large-scale experimental evidence for contextuality in the large language model BERT. We constructed a linguistic schema modelled over a contextual quantum scenario, instantiated it in the Simple English Wikipedia, and extracted probability distributions for the instances. This led to the discovery of sheaf-contextual and CbD contextual instances. We prove that these contextual instances arise from semantically similar words by deriving an equation that relates degrees of contextuality to the Euclidean distance of BERT’s embedding vectors.

How can large language models become more human – https://discovery.ucl.ac.uk/id/eprint/10196296/1/2024.cmcl-1...

> Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal.

  • Most LM work implicitly uses surprisal = -log p(w | prefix) as the processing cost. But psycholinguistics keeps finding cases (garden-path sentences, etc.) where human difficulty is less about the next word being unlikely and more about how much of the current parse / interpretation has to be torn down and rebuilt. That’s essentially what Wang et al. formalize with their Incompatibility Fraction: they combine an LLM’s lexical predictions with a dependency parser, build a sheaf-style structure over prefixes, and measure how inconsistent the local parse distributions are with any single global structure. That incompatibility correlates with human reading times and distinguishes easy vs hard garden paths better than surprisal alone.

    If you take that seriously, you end up with a different "surprise" objective: not just "this token was unlikely", but "this token forced a big update of my latent structure". In information-theoretic terms, the distortion term in a Rate–Distortion / Information Bottleneck objective stops being pure log-loss and starts to look like a backtracking cost on your semantic/structural state.

    Now look at Shani et al.’s From Tokens to Thoughts paper: they compare LLM embeddings to classic human typicality/membership data (Rosch, Hampton, etc.) using RDT/IB, and show that LLMs sit in a regime of aggressive compression: broad categories line up with humans, but fine-grained typicality and "weird" members get squashed. Humans, by contrast, keep higher-entropy, messier categories – they "waste bits" to preserve contextual nuance and prototype structure.

    Quantum cognition folks like Aerts have been arguing for years that this messiness is not a bug: phenomena like the Guppy effect (where "guppy" is a so-so Pet and a so-so Fish but a very typical Pet-Fish) are better modelled as interference in a Hilbert space, i.e. as emergent concepts rather than classical intersections. Lo et al. then show that large LMs (BERT) already exhibit quantum-like contextuality in their probability distributions: thousands of sheaf-contextual and tens of millions of CbD-contextual instances, with the degree of contextuality tightly related to embedding distances between competing words.

    Put those together and you get an interesting picture:

    Current LMs do live in a contextual / interference-ish regime at the probabilistic level, but their embedding spaces are still optimized for pointwise predictive compression, not for minimizing re-interpretation cost over time.

    If you instead trained them under a "surprise = prediction error + structural backtracking cost" objective (something like log-loss + sheaf incompatibility over parses/meanings), the optimal representations wouldn’t be maximally compressed clusters. They’d be the ones that make structural updates cheap: more typed, factorized, role-sensitive latent spaces where meaning is explicitly organized for recomposition rather than for squeezing out every last bit of predictive efficiency.

    That’s exactly the intuition behind DisCoCat / categorical compositional distributional semantics: you force grammar and semantics to share a compact closed category, treat sentence meaning as a tensor contraction over typed word vectors, and design the embedding spaces so that composition is a simple linear map. You’re trading off fine-grained, context-specific "this token in this situation" information for a geometry that makes it cheap to build and rebuild structured meanings.

    Wang et al.’s Incompatibility Fraction is basically a first step toward such an objective, Shani et al. quantify how far LMs are from the "human" point on the compression–meaning trade-off, Aerts/Lo show that both humans and LMs already live in a quantum/contextual regime, and DisCoCat gives a concrete target for what "structured, recomposable embeddings" could look like. If we ever switch from optimizing pure cross-entropy to "how painful is it to revise my world-model when this token arrives?", I’d expect the learned representations to move away from super-compact clusters and towards something much closer to those typed, compositional spaces.