Comment by shawnz

2 days ago

I can't imagine they consider it a bug, it is a common and beneficial property of essentially every LLM today. You want to be able to represent common words with single tokens for efficiency, but at the same time you still need to be able to represent prefixes of those words in the cases where they occur separately

I find this surprising, but I suppose it must be more efficient overall.

Presumably parsing text into tokens is done in some deterministic way. If it is done by greedily taking the longest-matching prefix that is a token, then when generating text it should be possible to "enrich" tokens that are prefixes of other tokens with additional constraints to force a unique parse: E.g., if "e" is a token but "en" is too, then after generating "e" you must never generate a token that begins with "n". A text generated this way can be deterministically parsed by the greedy parser.

Alternatively, it would suffice to restrict to a subset of tokens that are a prefix code. This would be simpler, but with lower coding efficiency.

  • Regarding the first part: that's an interesting idea, although I worry it would bias the outputs in an unrealistic way. Then again, maybe it would only impact scenarios that would have otherwise been unparsable anyway?

    Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality

    • The first approach would only affect outputs that would have been otherwise unparseable.

      The second approach works with any subset of tokens that form a prefix code -- you effectively set the probability of all tokens outside this subset to zero (and rescale the remaining probabilities if necessary). In practice you would want to choose a large subset, which means you almost certainly want to avoid choosing any single-character tokens, since they can't coexist with tokens beginning with that character. (Choosing a largest-possible such subset sounds like an interesting subproblem to me.)

      3 replies →