Comment by willvarfar
2 years ago
For anyone interested in the equivalence between AI and compression, take a look at the Hutter Prize :) http://prize.hutter1.net/
Also worth a look is the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html - currently the world's best compressor is a neural network made by ... the renowned Fabrice Bellard, creator of ffmpeg and QEMU!
And I really dig these pages' refreshingly appropriate text-only style!
Specially compression algos that use arithmetic coding with interval weights adjusted based on the prediction of what is likely coming next are very similar. They adjust the arithmetic coding (https://en.wikipedia.org/wiki/Arithmetic_coding) based on the context of the byte/bit to predict, so the more accurate the predicted continuation is, the more efficient is the encoding. The task is very similar to that of the transformers like GPT. A perfect prediction will almost have no additional storage cost because the arithmetic interval doesn't get smaller, and thus no bit gets stored - but anyway, you have to count the size of the decompressor to get a fair benchmark.
How is it similar?
There’s been a lot of study going the other direction - using neural networks to aid the entropy prediction in classical compression algorithms - but I’m not seeing the conceptual link between how transformer/attention models work internally and how gzip works internally beyond “similar words are easy to compress”
I’m not seeing it because GPT representations are just vectors of fixed, not varying, size
An LLM or, well, any statistical model, is about prediction. As in, given some preceding input, what comes next?
One way to measure the accuracy of the model, as in it's "intelligence", is to use the predictions to turn input into all the differences from the prediction; if it's good at predicting then there will be fewer differences and it will compress it.
So seeing how well your model can compress some really big chunk of text is a very good objective measure of it's strength and compare it to the strength of others?
So a competition is born! :)
3 replies →
gzip uses LZ and Huffman coding and not arithmetic coding with a predictor, so yes, these are not similar.
When you dive deep into the math, many things are fundamentally the same. Superresolution is just glorified deconvolution. A single layer perceptron is a linear kernel SVM is a logistic regression. FFT is just factorisation.
What does it take to learn math at such a deep level?
Even just undergraduate linear algebra & calc + real analysis & prob/stats... With good teachers to draw the connections
Stuff like info theory are amazing bc of this... But easy to miss if you are just working through a drier text
The crazier version for me is we have a physics professor on our team whose grounding intuitions are use ideas like black holes for mental intuition... Which does not work as well for the rest of the team
(And a lot of modern ML/AI feels very engineered and interchangeable after that, like 'metric function of the month')
Either a year in college, or your entire lifetime. Not that it really matters since mathematically, they're both fundamentally the same, they're both just numbers.
You don’t really need to go that deep, just broad
You can get a lot just by reading Wikipedia and following links in each article
It’s hard to grasp the formulas and proofs sometimes, but if you only care about understanding the concepts, there are a lot of dots to connect
1 reply →
decades of trying, in my case. and I still only understand a little. I focused a lot on linear algebra, graphs, and probability- most of abstract math is completely outside my understanding.
other folks pick it up quickly in college.
Numerical methods are Ax=B
Something important to note is that the authors use normalized compression distance (NCD). NCD is a way to approximate Kolmogorov complexity.
This is a pretty old idea, see [1,2]. Old but still very useful, like perceptrons.
[1] Li and Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications.
[2] Clustering by compression. https://arxiv.org/pdf/cs/0312044
Fabrice Bellard is such a living legend. Add to that list QuickJS, jslinux, tcc and TinyGL.
yes, absolutely!
this kind of "compression" is essentially "understanding through theories" where theories are like those from physics
which are a lot like stories that explain many things with just the same "characters". in this context a 'character' is more like a concept. for example an atom would fit this bill.