← Back to context

Comment by roenxi

2 days ago

> Anthropic found that it Claude will pretend that it used the "standard" way to do addition- add the digits, carry the 1, etc- but the pattern of activations showed it using a completely different algorithm.

That doesn't mean much; humans sometimes do the same thing. I recall a fun story about a mathematician with synesthesia multiplying numbers by mixing the colours together. With a bit of training such a person could also pretend to be executing a normal algorithm for the purposes of passing tests.

Even then the human doesn't know how they execute the algorithm, or mix the colours together - our conscious self-reflective mind has limits as to how far into our neural network weights it can reach. Can get further with lots of meditation, but it is still definitionally limited (in information theory terms).