Comment by layer8

1 year ago

I’m not sure that training data about that would be required. Shouldn’t the model be able to recognize that `["re", "cogn", "ize"]` represents the same sequence of tokens as `recognize`, assuming those are tokens in the model?

More generally, would you say that LLMs are generally unable to reason about sequences of items (not necessarily tokens) and compare them to some definition of “valid” sequences that would arise from the training corpus?

No. In the model, tokens are random numbers. But if you consider a sentence to be a sequence of words, you can say that LLMs are quite competent about reasoning about those sequences.

  • ChatGPT is able to spell the word "recognize" when asked.

    So it is able to take a sequence of tokens ["recogn", "ize"] and transform it into a sequence of tokens [" R", " E", " C", " O", " G", " N", " I", " Z", " E"]