Comment by jkhdigital
6 years ago
For anyone who doesn't get why this would happen: GPT-2 basically outputs a probability distribution for its guess of the next word, and then the encoder uses these distributions to perform arithmetic coding adaptively. If the next word in the source text is not actually present anywhere in the output distribution, it cannot encode it.
I may be wrong, but I thought GPT2 could also output partial words/syllables (for unknown words), or individual letters if they don't make a syllable.
The simple way to achieve that is to have an encoding dictionary of words, but then add to the end of the dictionary "sh", etc., and then add to the end of that "a", "b", "c", etc. When tokenizing words, prefer to use a whole word, but if you can't do that, split to syllables, and failing that, individual letters. That has the benefit that any ascii string can go through the system.
Yes, this is why I said "basically". The fact that GPT-2 tokens are not necessarily prefix-free can be a problem for arithmetic coding, but I've found that "greedy" parsing almost never fails in practice.
So yes, there are ways to work around this but it seems like the simplest explanation for why unusual words break the encoder.