Comment by Antibabelic

3 days ago

Without a direct comparison to human internals (grounded in neurobiology, rather than intuition), it's hard to say how similar these similarities are, and if they're not simply a result of the transparency illusion (as Sydney Lamb defines it).

However, if you can point us to some specific reading on mechanistic interpretability that you think is relevant here, I would definitely appreciate it.

That's what I'm saying: there is no "direct comparison grounded in neurobiology" for most things, and for many things, there simply can't be one. For the same reason you can't compare gears and springs to silicon circuits 1:1. The low level components diverge too much.

Despite all that, the calculator and the arithmometer do the same things. If you can't go up an abstraction level and look past low level implementation details, then you'll remain blind to that fact forever.

What papers depends on what you're interested in. There's a lot of research - ranging from weird LLM capabilities and to exact operation of reverse engineered circuits.

  • There is no level of abstraction to go up sans context. Again, let me repeat myself as well: the calculator and the arithmometer do the same things -- from the point of view of the cleric that needs to add and subtract quickly. Otherwise they are simply two completely different objects. And we will have a hard time making correct inferences about how one works based only on how we know the other works, or, e.g. how calculating machines work.

    What I'm interested in is evidence that supports that "The more you try to look into the LLM internals, the more similarities you find". Some pointers to specific books and papers will be very helpful.

    • > Otherwise they are simply two completely different objects.

      That's where you're wrong. Both objects reflect the same mathematical operations in their structure.

      Even if those were inscrutable alien artifacts to you, even if you knew nothing about who constructed them, how or why? If you studied them, you would be able to see the similarities laid bare.

      Their inputs align, their outputs align. And if you dug deep enough? You would find that there are components in them that correspond to the same mathematical operations - even if the two are nothing alike in how exactly they implement them.

      LLMs and human brains are "inscrutable alien artifacts" to us. Both are created by inhuman optimization pressures. Both you need to study to find out how they function. It's obvious, though, that their inputs align, and their outputs align. And the more you dig into internals?

      I recommend taking a look at Anthropic's papers on SAE - sparse autoencoders. Which is a method that essentially takes the population coding hypothesis and runs with it. It attempts to crack the neural coding used by the LLM internally to pry interpretable features out of it. There are no "grandmother neurons" there - so you need elaborate methods to examine what kind of representations an LLM can learn to recognize and use in its functioning.

      Anthropic's work is notable because they have not only managed to extract features that map to some amazingly high level concepts, but also prove causality - interfering with the neuron populations mapped out by SAE changes LLM's behaviors in predictable ways.

      5 replies →