Comment by MR4D

5 months ago

I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

18 comments

MR4D

umeshunni 5 months ago

> in that a distilled model of an LLM is like a JPEG of a photo

That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

timschmidt 5 months ago
And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.
- t_mann 5 months ago
  
  Finding minimum complexity explanations isn't what finding natural laws is about, I'd say. It's considered good practice (Occam's razor), but it's often not really clear what the minimal model is, especially when a theory is relatively new. That doesn't prevent it from being a natural law, the key criterion is predictability of natural phenomena, imho. To give an example, one could argue that Lagrangian mechanics requires a smaller set of first principles than Newtonian, but Newton's laws are still very much considered natural laws.
  
  9 replies →
homarp 5 months ago

hence https://news.ycombinator.com/item?id=34724477 )
kedarkhand 5 months ago
Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken
- bloomingkales 5 months ago
  
  And we can answer the question why quantization works with a lossy format, since quantization just drops accuracy for space but still gives us a good enough output, just like a lossy jpeg.
  Reiterating again, we can lose a lot of data (have incomplete data) and have a perfectly visible jpeg (or MP3, same thing).

cmgriffing 5 months ago

This brings up an interesting thought too. A photo is just a lossy representation of the real world.

So it's lossy all the way down with LLMs, too.

Reality > Data created by a human > LLM > Distilled LLM

ziofill 5 months ago

What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.

fennecfoxy 5 months ago

Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.