Comment by jkhdigital

6 years ago

Yes, this is why I said "basically". The fact that GPT-2 tokens are not necessarily prefix-free can be a problem for arithmetic coding, but I've found that "greedy" parsing almost never fails in practice.

So yes, there are ways to work around this but it seems like the simplest explanation for why unusual words break the encoder.