Comment by jandrese
2 months ago
Yes, but are you going to special case all of these pain points? The whole point of these LLMs is that they learn from training data, not from people coding logic directly. If you do this people will come up with a dozen new ways in which the models fail. They are really not hard to find. Basically asking them to do anything novel is at risk of complete failure. The interesting bit is that LLMs tend to work best a "medium difficulty" problems. Homework questions and implementing documented APIs and things like that. Asking them to do anything completely novel tends to fail as does asking them to do something so trivial that normal humans won't bother even writing it down.
It makes sense when users ask for information not available in the tokenized values though. In the abstract, a tool that changes tokenization for certain context contents when a prompt references said contents is probably necessary to solve this issue (if you consider it worth solving).
It's a fools errand. The kinds of problems you end up coding for are the ones that are blatantly obvious and ultimately useless except as a gotcha to the AI engines. All you're doing is papering over the deficiency of the model without actually solving a problem.
This is less a deficiency of the model, and more of a deficiency of the encoder IMO. You can consider the encoder part of the model, but I think the semantics of our conversation require differentiating between the two.
Tokenization is an inherent weakness of current LLM design, so it makes sense to compensate for it. Hopefully some day tokenization will no longer be necessary.