Comment by kamranjon

6 days ago

This is a pretty awful take. Everyone understands they are modifying the weights - that is the point. It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them. The awesome thing about fine-tuning is that the weights are malleable and you have a great base to start from.

Also the basic premise that knowledge injection is a bad use-case seems flawed? There are countless open models released by Google that completely fly in the face of this. Medgemma is just Gemma 3 4b fine-tuned on a ton of medical datasets, and it’s measurably better than stock Gemma within the medical domain. Maybe it lost some ability to answer trivia about Minecraft in the process, but isn’t that kinda implied by “fine-tuning” something? Your making it purpose built for a specific domain.

Medgemma gets its domain expertise from pre-training on medical datasets, not finetuning. It’s pretty uncharitable to call the post an awful take if you’re going to get that wrong.

  • You can call it pre-training but it’s based on Gemma 3 4b - which was already pre-trained on a general corpus. It’s the same process, so you’re just splitting hairs. That is kind of my point, fine-tuning is just more training. If you’re going to say that fine-tuning is useless you are basically saying that all instruct-tuned models are useless as well - because they are all just pre-trained models that have been subsequently trained (fine-tuned) on instruction datasets.

> It’s not like these models were released with all of the weights perfectly accounted for and changing them in any way ruins them.

So more imperfect is better?

Of course the model’s parameters leave a many billions of elements vector path for improvement. But what circuitous path is that, which it didn’t already find?

You can’t find it by definition if you don’t include all the original data with the tuning data. You have radically changed the optimization surface with no contribution from the previous data at all.

The one use case that makes sense is sacrificing functionality to get better at a narrow problem.

You are correct about that.

A man who burns his own house down may understand what they are doing and do it intentionally - but without any further information still appears to be wasting his time and doing something stupid. There isn't any contradiction between something being a waste of time and people doing it on purpose - indeed the point of the article is to get some people to change what they are purposefully doing.

He's proposing alternatives he thinks are superior. He might well be right too, although I don't have a horse in the race but LORA seem like a more satisfying approach to get a result than retraining the model and giving LLMs tools seems to be proving more effective too.

  • It’s possible I misinterpreted a bit the gist of the article - in my mind nobody is doing fine-tuning these days without using techniques like LoRA or DoRA. But they are using these techniques because they are computationally efficient and convenient, and not because they perform significantly better than full fine-tuning.