Comment by kouteiheika
5 days ago
> Adapter Modules and LoRA (Low-Rank Adaptation) insert new knowledge through specialized, isolated subnetworks, leaving existing neurons untouched. This is best for stuff like formatting, specific chains, etc- all of which don’t require a complete neural network update.
This highlights to me that the author doesn't know what they're talking about. LoRA does exactly the same thing as normal fine-tuning, it's just a trick to make it faster and/or be able to do it on lower end hardware. LoRA doesn't add "isolated subnetworks" - LoRA parameters are added to the original weights!
Here's the equation for the forward pass from the original paper[1]:
h = W_{0} * x + ∆W * x = W_{0} * x + B * A * x
where "W_{0}" are the original weights and "B" and "A" (which give us "∆W_{x}" after they're multiplied) are the LoRA adapter. And if you've been paying attention it should also be obvious that, mathematically, you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do, or you could even create a LoRA adapter from a fully fine-tuned model by calculating "W - W_{0}" to get ∆W and then do SVD to recover B and A.
If you know what you're doing anything you can do with LoRA you can also do with full-finetuning, but better. It might be true that it's somewhat harder to "damage" a model by doing LoRA (because the parameter updates are fundamentally low rank due to the LoRA adapters being low rank), but that's a skill issue and not a fundamental property.
> LoRA does exactly the same thing as normal fine-tuning
You wrote exactly so I'm going to say "no". To clarify what I mean: LoRA seeks to accomplish a similar goal as "vanilla" fine-tuning but with a different method (freezing existing model weights while adding adapter matrices that get added to the original). LoRA isn't exactly the same mathematically either; it is a low-rank approximation (as you know).
> LoRA doesn't add "isolated subnetworks"
If you think charitably, the author is right. LoRA weights are isolated in the sense that they are separate from the base model. See e.g. https://www.vellum.ai/blog/how-we-reduced-cost-of-a-fine-tun... "The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily."
> you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do
Yes, one can do that. But on what basis do you say that "most people do"? Without having collected a sample of usage myself, I would just say this: there are many good reasons to not merge (e.g. see link above): less storage space if you have multiple adapters, easier to swap. On the other hand, if the extra adapter slows inference unacceptably, then don't.
> This highlights to me that the author doesn't know what they're talking about.
It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
> You wrote exactly so I'm going to say "no". [...] If you think charitably, the author is right.
No, the author is objectively wrong. Let me quote the article and clarify myself:
> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
This is just incorrect. LoRA is exactly like normal fine-tuning here in this particular context. The author's argument is that you should do LoRA because it doesn't do any "destructive overwriting", but in that aspect it's no different than normal fine-tuning.
In fact, there's evidence that LoRA can actually make the problem worse[1]:
> we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions [...] LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside the adaptation task’s distribution, despite matching accuracy in distribution.
[1] -- https://arxiv.org/pdf/2410.21228
To be fair, "if you don't know what you're doing then doing LoRA over normal finetuning" is, in general, a good advice in my opinion. But that's not what the article is saying.
> But on what basis do you say that "most people do"?
On the basis of seeing what the common practice is, at least in the open (in the local LLM community and in the research space).
> I would just say this: there are many good reasons to not merge
I never said that there aren't good reasons to not merge.
> It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
If we zoom out a bit to one point he’s trying to make there, while LoRA is fine tuning I think it’s fair to call it a more modular approach than base SFT.
That said, I find the article as a whole off-putting. It doesn’t strengthen one’s claims to call things stupid or a total waste of time. It deals in absolutes, and rants in a way that misleads and foregoes nuance.
I learned a lot of perspective on LORA. Thanks folks
> No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
I get that. So what can we do?
One option is when criticizing, write as clearly as possible. Err on the side of overexplaining. From my point of view, it took a back-and-forth for your criticism to become clear.
I'll give an example when more charity and synthesis is welcome:
>> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
> This is just incorrect.
"This" is rather unclear. There are many claims in the quote -- which are you saying are incorrect? Possibilities include:
1. "Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting."
Sometimes, yes. More often than not? Maybe. Categorically? I'm not sure. [1]
2. "When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects."
Yes, this can happen. Mitigations can reduce the chances.
3. "Instead, use modular methods like [...] adapters."
Your elision dropped some important context. Here's the full quote:
> Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.
This logic is sound, almost out of tautology: the original model is unchanged.
To get more specific: if one's bolted-on LoRA module destroyed some knowledge, one can take that into account and compensate. Perhaps use different LoRA modules for different subtasks then delegate with a mixture of experts? (I haven't experimented with this particular architecture, so maybe it isn't a great example -- but even if it falls flat, this example doesn't undermine the general shape of my argument.)
In summary, after going sentence by sentence, I see one sentence that is dubious, but I don't think it is the same one you would point to.
[1] I don't know if this is considered a "settled" matter. Even if was considered "settled" in ML research, that wouldn't meet my bar -- I have a relativity low opinion of ML research in general (the writing quality, the reproducibility, the experimental setups, the quality of the thinking!, the care put into understanding previous work)
Sorry to be a downer but basically every statement you’ve made above is incorrect.
> Sorry to be a downer but basically every statement you’ve made above is incorrect.
You don't need to apologize for being a "downer", but it would be better if you were specific in your criticisms.
I welcome feedback, but it has to be specific and actionable. If I'm wrong, set me straight.
This is a two-way street: if you were unfair or uncharitable or wrong, you have to own that too. It is incumbent upon an intellectually honest reader to first seek a plausible interpretation under which a statement is indeed correct. Some people have a tendency to only find one possible interpretation under which a statement is wrong. This is insufficient. Bickering over interpretations is less useful; understanding another's meaning is how we grow.
> that's a skill issue and not a fundamental property
This made me laugh.
You seem like you may know something I've been curious about.
I'm a shader author these days, haven't been a data scientist for a while, so it's going to distort my vocab.
Say you've got a trained neural network living in a 512x512 structured buffer. It's doing great, but you get a new video card with more memory so you can afford to migrate it to a 1024x1024. Is the state of the art way to retrain with the same data but bigger initial parameters, or are there other methods that smear the old weights over a larger space to get a leg up? Anything like this accelerate training time?
... can you up sample a language model like you can lowres anime profile pictures? I wonder what the made up words would be like.
In general this is of course an active area of research, but yes, you can do something that and people have done it successfully[1] by adding extra layers to an existing model and then continuing to train it.
You have to be careful about the "same data" part though; ideally you want to train once on unique data[2] as excessive duplication can harm the performance of the model[3], although if you have limited data a couple of training epochs might be safe and actually improve the performance of the model[4].
[1] -- https://arxiv.org/abs/2312.15166
[2] -- https://arxiv.org/abs/1906.06669
[3] -- https://arxiv.org/abs/2205.10487
[4] -- https://galactica.org/static/paper.pdf
In addition to increasing the number of layers, you can also grow the weight matrices and initialize by tiling them with the smaller model's weights https://neurips.cc/media/neurips-2023/Slides/83968_5GxuY2z.p...
Thank you for taking the time to provide me all this reading.
This might be obvious, but just to state it explicitly for everyone: you can freeze the weights of the existing layers if you want to train the new layers but want to leave the existing layers untouched.