Comment by xg15

5 months ago

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

51 comments

xg15

ZeljkoS 5 months ago

We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

Arthur_ODC 5 months ago

So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?
3abiton 5 months ago
So more 'mature' models might arise in the near future with less params and better benchmarks?
- coder543 5 months ago
  
  That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.
- raducu 5 months ago
  
  "Better", but not better than the model they were distilled from, at least that's how I understand it.
  
  3 replies →
- andreasmetsala 5 months ago
  
  They might also be more biased and less able to adapt to new technology. Interesting times.

MR4D 5 months ago

I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

umeshunni 5 months ago
> in that a distilled model of an LLM is like a JPEG of a photo
That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
- timschmidt 5 months ago
  
  And what is compression but finding the minimum amount of information required to reproduce a phenomena? I.e. discovering natural laws.
  
  10 replies →
- homarp 5 months ago
  
  hence https://news.ycombinator.com/item?id=34724477 )
- kedarkhand 5 months ago
  
  Well, JPEG can be thought of as an compression of the natural world of whose photograph was taken
  
  1 reply →
cmgriffing 5 months ago

This brings up an interesting thought too. A photo is just a lossy representation of the real world.
So it's lossy all the way down with LLMs, too.
Reality > Data created by a human > LLM > Distilled LLM
ziofill 5 months ago

What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.
fennecfoxy 5 months ago

Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.

cztomsik 5 months ago

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

pertymcpert 5 months ago

For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.

teruakohatu 5 months ago

> still have no real comprehensive understanding how the models work.

We do understand how they work, we just have not optimised their usage.

For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

gessha 5 months ago
Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.
Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.
Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.
- raducu 5 months ago
  
  > _fundamentally_ don’t understand how deep learning systems works.
  It's like saying we don't understand how quantum chromodynamics works. Very few people do, and it's the kind of knowledge not easily distilled for the masses in an easily digestible in a popsci way.
  Look into how older CNNs work -- we have very good visual/accesible/popsci materials on how they work.
  I'm sure we'll have that for LLM but it's not worth it to the people who can produce that kind of material to produce it now when the field is moving so rapidly, those people's time is much better used in improving the LLMs.
  The kind of progress being made leads me to believe there absolutely ARE people who absolutely know how the LLMs work and they're not just a bunch of monkeys randomly throwing things at GPUs and seeing what sticks.
  
  5 replies →
- brookst 5 months ago
  
  Isn't that just scale? Even small LLMs have more parts than any car.
  LLMs are more analogous to economics, psychology, politics -- it is possible there's a core science with explicability, but the systems are so complex that even defining the question is hard.
  
  6 replies →
spiorf 5 months ago
We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.
- Valgrim 5 months ago
  
  It feels less like a word prediction algorithm and more like a world model compression algorithm. Maybe we tried to create one and accidentaly created the other?
  
  2 replies →
- fennecfoxy 5 months ago
  
  Eh I feel like that mostly just down to; yes transformers are a "next token predictor" but during fine tuning for instruct the attention related wagon slapped on the back is partially hijacked as a bridge from input token->sequences of connections in the weights.
  For example if I ask "If I have two foxes and I take away one, how many foxes do I have?" I reckon attention has been hijacked to essentially highlight the "if I have x and take away y then z" portion of the query to connect to a learned sequence from readily available training data (apparently the whole damn Internet) where there are plenty of examples of said math question trope, just using some other object type than foxes.
  I think we could probably prove it by tracing the hyperdimensional space the model exists in and ask it variants of the same question/find hotspots in that space that would indicate it's using those same sequences (with attention branching off to ensure it replies with the correct object type that was referenced).
adamc 5 months ago

The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.