Comment by abecedarius

2 years ago

One thing many people are missing: the simple gzip-for-text-classification hack is not the contribution of this paper. (They reference the standard intro AI textbook for that hack.) The contribution is to use the gzip numbers together with k-nearest-neighbors.

In section 6.2 they compare gzip-distance+kNN vs. gzip-distance on its own on four problems: it was better on two, and worse on two others.

Another bit of background I guess is worth saying: language models are pretrained with a compression objective. That is, the loss function in pretraining is the cross entropy of the input text, which means "minimize the compressed length of this input if you fed it to this LM driving an arithmetic coder".