← Back to context

Comment by janalsncm

6 months ago

The bitter-er lesson is that distillation from bigger models works pretty damn well. It’s great news for the GPU poor, not great for the guys training the models we distill from.

Distillation is great for researchers and hobbyists.

But nearly all frontier models have anti-distillation ToS, so distillation is out of question for western commercial companies like Apple.

  • Even if Apple needs to train an LLM from scratch, they can distill it and deploy on edge devices. From that point, inference is free to them.