Comment by MR4D
2 months ago
I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.
The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.
> in that a distilled model of an LLM is like a JPEG of a photo
That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.
Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.
9 replies →
hence https://news.ycombinator.com/item?id=34724477 )
Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken
And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.
Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).
This brings up an interesting thought too. A photo is just a lossy representation of the real world.
So it's lossy all the way down with LLMs, too.
Reality > Data created by a human > LLM > Distilled LLM
What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.
Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.