Comment by zahlman

2 months ago

> The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.

You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?

> For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.

The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.

> You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?

Nothing of the sort. They're _capable_ of doing so. For something as simple as addition you can even hand-craft weights which exactly solve it.

> The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.

Yes? The architecture is capable of both mapping tokens to character counts and of addition with a fraction of their current parameter counts. It's not all that hard.