← Back to context

Comment by kingstnap

2 months ago

If you think about the architecture, how is a decoder transformer supposed to count? It is not magic. The weights must implement some algorithm.

Take a task where a long paragraph contains the word "blueberry" multiple times, and at the end, a question asks how many times blueberry appears. If you tried to solve this in one shot by attending to every "blueberry," you would only get an averaged value vector for matching keys, which is useless for counting.

To count, the QKV mechanism, the only source of horizontal information flow, would need to accumulate a value across tokens. But since the question is only appended at the end, the model would have to decide in advance to accumulate "blueberry" counts and store them in the KV cache. This would require layer-wise accumulation, likely via some form of tree reduction.

Even then, why would the model maintain this running count for every possible question it might be asked? The potential number of such questions is effectively limitless.