Comment by HarHarVeryFunny

2 days ago

> So it needs to know facts, albeit the currently accepted ones. Knowing the facts is a good way to compression data.

It's not a compression engine - it's just a statistical predictor.

Would it do better if it was incentivized to compress (i.e training loss rewarded compression as well as penalizing next-word errors)? I doubt it would make a lot of difference - presumably it'd end up throwing away the less frequently occurring "outlier" data in favor of keeping what was more common, but that would result in it throwing away the rare expert opinion in favor of retaining the incorrect vox pop.

Both compression engines and llm work by assigning scores to the next token. If you can guess the probability distribution of the next token you have a near perfect text compressor, and a near perfect llm. Yeah in the real world they have different trade-offs.

Here's a paper by deep mind. https://arxiv.org/pd7f/2309.10668 - titled LANGUAGE MODELING IS COMPRESSION

  • An LLM is a transformer of a specific size (number of layers, context width, etc), and ultimately number of parameters. A trillion parameter LLM is going to use all trillion parameters regardless of whether you train it on 100 samples or billions of them.

    Neural nets, including transformers, learn by gradient descent, according to the error feedback (loss function) they are given. There is no magic happening. The only thing the neural net is optimizing for is minimizing errors on the loss function you give it. If the loss function is next-token error (as it is), then that is ALL it is optimizing for - you can philosophize about what they are doing under the hood, and write papers about that ("we advocate for viewing the prediction problem through the lens of compression"), but at the end of the day it is only pursuant to minimizing the loss. If you want to encourage compression, then you would need to give an incentive for that (change the loss function).