← Back to context

Comment by minimaxir

6 years ago

A neat trick I found while working with GPT-2 is that byte-pair encoding is, in itself a compression method. With Huggingface Transformers, encoding/decoding this way is very fast.

I've implemented this approach in my aitextgen package (https://github.com/minimaxir/aitextgen/blob/master/aitextgen...) to encode massive input datasets as a uint16 Numpy array; when gzipped on disk, it's about 1/10th of the original data set size.

However, the technique in this submission gets about compression to 1/10 w/o the gzipping. Hmm.

This is really just a way to show how good GPT-2 is at predicting text. If you know anything about information theory, you'll know that the entropy of the information source places a hard limit on how much it can be compressed. If GPT-2 is really good at predicting English text, then the entropy of its output should be very very close to the entropy of natural English text. Thus, using GPT-2 predictions as an adaptive source encoder will achieve compression ratios that approach the information content (entropy) of English text.