Comment by cainxinth

1 month ago

Is it fair to say that the “Rs in strawberry problem” will not be “cleanly” solved unless we advance beyond tokenization?

3 comments

cainxinth

idiotsecant 1 month ago

I think tokenization is probably not going anywhere, but higher layers need the ability to inspect 'raw' data on demand. You don't spell out most words as you read them, but you can bring the focus of your entire mind to the spelling of the word strawberry if you so choose. Models need that ability as well.

hippo22 1 month ago

Couldn’t this be solved by replacing the tokenized input with a model that outputs the tokenization and then training the entire thing as one larger model? The goal would be to make tokenization a function of the model input.

nl 1 month ago

I don't see why that follows.

The “Rs in strawberry problem” is literally "count the token R" in the word "strawberry".

One could argue that the learnt tokenization model where it is tokenized into 3 tokens (see https://platform.openai.com/tokenizer) is problematic, but one solution to that is to double-down on it and learn tokenization as part of the end-to-end training instead of separately.

If you mean that the idea of the current idea of the tokenization model being entirely fixed then I agree.

(I'm not entirely sure how multi-modal models function in this regard - they must have a idea of the bytestream, but not familiar enough with that to comment intelligently.)