Comment by kouteiheika

8 months ago

> You wrote exactly so I'm going to say "no". [...] If you think charitably, the author is right.

No, the author is objectively wrong. Let me quote the article and clarify myself:

> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.

This is just incorrect. LoRA is exactly like normal fine-tuning here in this particular context. The author's argument is that you should do LoRA because it doesn't do any "destructive overwriting", but in that aspect it's no different than normal fine-tuning.

In fact, there's evidence that LoRA can actually make the problem worse[1]:

> we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions [...] LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside the adaptation task’s distribution, despite matching accuracy in distribution.

[1] -- https://arxiv.org/pdf/2410.21228

To be fair, "if you don't know what you're doing then doing LoRA over normal finetuning" is, in general, a good advice in my opinion. But that's not what the article is saying.

> But on what basis do you say that "most people do"?

On the basis of seeing what the common practice is, at least in the open (in the local LLM community and in the research space).

> I would just say this: there are many good reasons to not merge

I never said that there aren't good reasons to not merge.

> It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.

No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.

3 comments

kouteiheika

WhitneyLand 8 months ago

If we zoom out a bit to one point he’s trying to make there, while LoRA is fine tuning I think it’s fair to call it a more modular approach than base SFT.

That said, I find the article as a whole off-putting. It doesn’t strengthen one’s claims to call things stupid or a total waste of time. It deals in absolutes, and rants in a way that misleads and foregoes nuance.

bicepjai 8 months ago

I learned a lot of perspective on LORA. Thanks folks

xpe 8 months ago

> No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.

I get that. So what can we do?

One option is when criticizing, write as clearly as possible. Err on the side of overexplaining. From my point of view, it took a back-and-forth for your criticism to become clear.

I'll give an example when more charity and synthesis is welcome:

>> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.

> This is just incorrect.

"This" is rather unclear. There are many claims in the quote -- which are you saying are incorrect? Possibilities include:

1. "Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting."

Sometimes, yes. More often than not? Maybe. Categorically? I'm not sure. [1]

2. "When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects."

Yes, this can happen. Mitigations can reduce the chances.

3. "Instead, use modular methods like [...] adapters."

Your elision dropped some important context. Here's the full quote:

> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.

This logic is sound, almost out of tautology: the original model is unchanged.

To get more specific: if one's bolted-on LoRA module destroyed some knowledge, one can take that into account and compensate. Perhaps use different LoRA modules for different subtasks then delegate with a mixture of experts? (I haven't experimented with this particular architecture, so maybe it isn't a great example -- but even if it falls flat, this example doesn't undermine the general shape of my argument.)

In summary, after going sentence by sentence, I see one sentence that is dubious, but I don't think it is the same one you would point to.

[1] I don't know if this is considered a "settled" matter. Even if was considered "settled" in ML research, that wouldn't meet my bar -- I have a relativity low opinion of ML research in general (the writing quality, the reproducibility, the experimental setups, the quality of the thinking!, the care put into understanding previous work)