← Back to context

Comment by yorwba

1 day ago

This architecture does not allow later layers to directly query KV data from earlier layers. Each iteration of the loop uses the same layer parameters, so the KV data in later layers may well end up being the same, but only if the model stops changing it in response to other tokens in the context. Which is also something a traditional multi-layer transformer could do. (But might not end up doing due to lack of corresponding inductive bias.)

None of this helps with the strawberry problem, where the very first layer already gets a tokenized representation, so there is no layer that "actually perceives those Rs."

Is it fair to say that the “Rs in strawberry problem” will not be “cleanly” solved unless we advance beyond tokenization?

  • I don't see why that follows.

    The “Rs in strawberry problem” is literally "count the token R" in the word "strawberry".

    One could argue that the learnt tokenization model where it is tokenized into 3 tokens (see https://platform.openai.com/tokenizer) is problematic, but one solution to that is to double-down on it and learn tokenization as part of the end-to-end training instead of separately.

    If you mean that the idea of the current idea of the tokenization model being entirely fixed then I agree.

    (I'm not entirely sure how multi-modal models function in this regard - they must have a idea of the bytestream, but not familiar enough with that to comment intelligently.)

  • I think tokenization is probably not going anywhere, but higher layers need the ability to inspect 'raw' data on demand. You don't spell out most words as you read them, but you can bring the focus of your entire mind to the spelling of the word strawberry if you so choose. Models need that ability as well.

    • Couldn’t this be solved by replacing the tokenized input with a model that outputs the tokenization and then training the entire thing as one larger model? The goal would be to make tokenization a function of the model input.