Comment by ZeljkoS

5 months ago

We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

8 comments

ZeljkoS

Arthur_ODC 5 months ago

So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?

3abiton 5 months ago

So more 'mature' models might arise in the near future with less params and better benchmarks?

coder543 5 months ago

That's been happening consistently for over a year now. Small models today are better than big models from a year or two ago.
raducu 5 months ago
"Better", but not better than the model they were distilled from, at least that's how I understand it.
- salemba 5 months ago
  
  I think this is how the "child brain" works too. The better the parents and the environement are, the better the child evolution is :)
  
  2 replies →
andreasmetsala 5 months ago

They might also be more biased and less able to adapt to new technology. Interesting times.