← Back to context

Comment by eyegor

2 years ago

I'm not familiar with most of these models in detail, but training time is generally less interesting than inference time to me. I don't care if it takes a month to train on $10k of gpu rentals if it can be deployed and run on a raspberry pi. I should definitely look into fasttext though.

As described in the paper, it didn't look like the gzip classifier trained at all. Inference involved reading the entire training set.

One could surely speed this up by preprocessing the training set and snapshotting the resulting gzip state, but that wouldn't affect the asymptotic complexity. In effect, the number of parameters is effectively equal to the size of the entire training set. (Of course, lots of fancy models scale roughly like this, too, so this isn't necessarily a loss.)

  • The gzip approach is much slower at inference time because you need to compute the gzip representation of the concatenated strings (query + target). Intuitively, this should be significantly more than a dot product of two embedding vectors.

    • The latter depends very strongly on how much computation is needed to compute those embedding vectors.

      If you run a GPT-3.5-sized mode to compute that embedding (which would be a bit absurd, but if you really want GPT-3.5-quality classification, you may well be doing something like this), you're looking through quite a few tens of billions of parameters and doing a correspondingly large number of FLOPs, which could be just as expensive as running gzip over your whole (small, private) training set.

      2 replies →