← Back to context

Comment by xpe

5 days ago

> LoRA does exactly the same thing as normal fine-tuning

You wrote exactly so I'm going to say "no". To clarify what I mean: LoRA seeks to accomplish a similar goal as "vanilla" fine-tuning but with a different method (freezing existing model weights while adding adapter matrices that get added to the original). LoRA isn't exactly the same mathematically either; it is a low-rank approximation (as you know).

> LoRA doesn't add "isolated subnetworks"

If you think charitably, the author is right. LoRA weights are isolated in the sense that they are separate from the base model. See e.g. https://www.vellum.ai/blog/how-we-reduced-cost-of-a-fine-tun... "The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily."

> you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do

Yes, one can do that. But on what basis do you say that "most people do"? Without having collected a sample of usage myself, I would just say this: there are many good reasons to not merge (e.g. see link above): less storage space if you have multiple adapters, easier to swap. On the other hand, if the extra adapter slows inference unacceptably, then don't.

> This highlights to me that the author doesn't know what they're talking about.

It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.

> You wrote exactly so I'm going to say "no". [...] If you think charitably, the author is right.

No, the author is objectively wrong. Let me quote the article and clarify myself:

> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.

This is just incorrect. LoRA is exactly like normal fine-tuning here in this particular context. The author's argument is that you should do LoRA because it doesn't do any "destructive overwriting", but in that aspect it's no different than normal fine-tuning.

In fact, there's evidence that LoRA can actually make the problem worse[1]:

> we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions [...] LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside the adaptation task’s distribution, despite matching accuracy in distribution.

[1] -- https://arxiv.org/pdf/2410.21228

To be fair, "if you don't know what you're doing then doing LoRA over normal finetuning" is, in general, a good advice in my opinion. But that's not what the article is saying.

> But on what basis do you say that "most people do"?

On the basis of seeing what the common practice is, at least in the open (in the local LLM community and in the research space).

> I would just say this: there are many good reasons to not merge

I never said that there aren't good reasons to not merge.

> It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.

No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.

  • If we zoom out a bit to one point he’s trying to make there, while LoRA is fine tuning I think it’s fair to call it a more modular approach than base SFT.

    That said, I find the article as a whole off-putting. It doesn’t strengthen one’s claims to call things stupid or a total waste of time. It deals in absolutes, and rants in a way that misleads and foregoes nuance.

  • > No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.

    I get that. So what can we do?

    One option is when criticizing, write as clearly as possible. Err on the side of overexplaining. From my point of view, it took a back-and-forth for your criticism to become clear.

    I'll give an example when more charity and synthesis is welcome:

    >> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.

    > This is just incorrect.

    "This" is rather unclear. There are many claims in the quote -- which are you saying are incorrect? Possibilities include:

    1. "Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting."

    Sometimes, yes. More often than not? Maybe. Categorically? I'm not sure. [1]

    2. "When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects."

    Yes, this can happen. Mitigations can reduce the chances.

    3. "Instead, use modular methods like [...] adapters."

    Your elision dropped some important context. Here's the full quote:

    > Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.

    This logic is sound, almost out of tautology: the original model is unchanged.

    To get more specific: if one's bolted-on LoRA module destroyed some knowledge, one can take that into account and compensate. Perhaps use different LoRA modules for different subtasks then delegate with a mixture of experts? (I haven't experimented with this particular architecture, so maybe it isn't a great example -- but even if it falls flat, this example doesn't undermine the general shape of my argument.)

    In summary, after going sentence by sentence, I see one sentence that is dubious, but I don't think it is the same one you would point to.

    [1] I don't know if this is considered a "settled" matter. Even if was considered "settled" in ML research, that wouldn't meet my bar -- I have a relativity low opinion of ML research in general (the writing quality, the reproducibility, the experimental setups, the quality of the thinking!, the care put into understanding previous work)

Sorry to be a downer but basically every statement you’ve made above is incorrect.

  • > Sorry to be a downer but basically every statement you’ve made above is incorrect.

    You don't need to apologize for being a "downer", but it would be better if you were specific in your criticisms.

    I welcome feedback, but it has to be specific and actionable. If I'm wrong, set me straight.

    This is a two-way street: if you were unfair or uncharitable or wrong, you have to own that too. It is incumbent upon an intellectually honest reader to first seek a plausible interpretation under which a statement is indeed correct. Some people have a tendency to only find one possible interpretation under which a statement is wrong. This is insufficient. Bickering over interpretations is less useful; understanding another's meaning is how we grow.