Comment by rybosome

6 days ago

I think the point the author misses is that many applications of fine-tuning are to get a model to do a single task. This is what I have done in my current role at my company.

We’ve fine-tuned open weight models for knowledge-injection, among other things, and get a model that’s better than OpenAI models at exactly one hyper specific task for our use case, which is hardware verification. Or, fine-tuned the OAI models and get significantly better OAI models at this task, and then only use them for this task.

The point is that a network of hyper-specific fine-tuned models is how a lot of stuff is implemented. So I disagree from direct experience with the premise that fine-tuning is a waste of time because it is destructive.

I don’t care if I “damage” Llama so that it can’t write poetry, give me advice on cooking, or translate to German. In this instance I’m only ever going to prompt it with: “Does this design implement the AXA protocol? <list of ports and parameters>”

> I think the point the author misses...

It looked to me like the author did know that. The title only says "Fine-tuning", but immediately in the article he talks about Fine-tuning for knowledge injection, in order to "ensure that their systems were always updated with new information".

Fine-tuning to help it not make the stupid mistake that it makes 10% of the time no matter what instructions you give it is a completely different use case.

Cost, latency, and performance are huge reasons why my company chooses to fine tune models. We start with using a base model for a task and as our traffic grows, we tune a smaller model, resulting huge performance and cost savings.

The author makes it specific they talk about finetuning "for Knowledge Injection". The give a quote that claims that finetuning is still useful for things like following a specific style, formatting etc. The title they chose could have been a bit more specific and less aphoristic.

What finetuning makes less sense is doing it merely to get a model eg up to date with changes in some library, or to learn a new library it did not know, or, even worse, your codebase. I think this is what OP talks about.

Exactly. I want the LLM to be able to respond to our customers’ questions accurately and/or generate proper syntax for our query language.

The whole point of base models is to be general purpose, and fine tuned models to be tuned for specific tasks using a base model.

  • Just to be clear, unless I'm misinterpreting this chain of comments, you do not want to fine-tune for information retrieval. FT is for skill enhancement. For information retrieval you want at least one of the over 100 implementations of RAG out there now.

    • Tool calling is a form of RAG, among the others. This is where MCP is really starting to move this forward.

What is your (company's) motivation behind using non-deterministic tools for "verification" instead of actually verifying designs using formal methods?

Let me preface by saying I'm not skeptical about your answer or think you're full of crap. Can you give me an example or two about a single task that you fine-tune for? Just trying to familiarize myself with more AI engineering tasks.

  • Yep!

    So my use case currently is admittedly very specific. My company uses LLMs to automate hardware design, which is a skill that most LLMs are very poor at due to the dearth of training data.

    For tasks which involve generation of code or other non-natural language output, we’ve found that fine-tuning with the right dataset can lift performance rapidly and decisively.

    An example task is taking in potentially syntactically incorrect HDL (Hardware Description Language) code and fixing the syntax issues. Fine-tuning boosted corrective performance significantly.

  • I used fine-tuning back in the day because GPT 3.5 struggled with the concept of determining if two sentences were equivalent or not. This was for grading language learning drills. It was a single skill for a specific task and I had lots of example data from thousands of spaced repetition quiz sessions. The base model struggled with the vague concept of “close enough” equivalence. Since that time, the state of the art has advanced to the point that I don’t need it anymore. I could probably do it to save some money but I’m pretty happy with GPT 4.1.

  • Any classification task. For example in search ranking, does a document contain the answer to this question?

> hardware verification

Could you give any rough details? I'm in this world, and have only experienced rigid/deterministic bounds for hardware, ideally based on "guaranteed by design" based models. The need for determinism has always prevented AI from being a part of it.

In this case, for doing specific tasks, it makes much more sense to optimize the prompts and the whole flow with DSPy, instead of just fine tuning for each task.

  • It's not either/or. Generally you finetune when optimized many-shot still doesn't hit your desired quality bar. And it turns out with RL, things like system prompts matter a lot, so searching over prompts is a good idea even when reinforcing the desirable circuits.

    • I am not an expert in fine tuning, but in the company I work for our fine tuned model didn't do any noticeable difference.

  • A wonderful approach generally and something we also do to some extent, but not a substitute for fine-tuning in our case.

    We are working in a domain where there is very limited training data, so what we really want is continued pre-training over a larger dataset. Absent that, fine-tuning is highly effective for non-NLP tasks.

  • That's only viable if the quality of the outputs can be automatically graded, reliably. GP's case sounds like one where that's probably possible, but for lots of specific tasks that isn't feasible, including the other ones he names:

    > write poetry, give me advice on cooking, or translate to German

    • Certainly, in those cases one needs to be clever and design an evaluation framework that will grade based on soft criteria, or maybe use user feedback. Still, over time a good train-test database should be built and leveraging dspy will do improvements even in those cases.

Interestingly the author mentions LoRa as a "special" way for fine-tuning thatis not destructive. Have you considered it or you opted for more direct fine-tuning?

  • It's not special and fine tuning a foundation model isn't destructive when you have checkpoints. LoRa allows you to approximate the end result of a fine tune while saving memory.

  • Haven’t tried it personally, as this was a use case where a classic SFT was effective for what we wanted and none of us had done LoRa before.

    Really interested in the idea though! The dream is that you have your big, general base model, then a bunch of LoRa weights for each task you’ve tuned on, where you can load/unload just the changed weights and swap the models out super fast on the fly for different tasks.

You do you, and if it works i’m not going to argue with your results, but for others, finetuning is the wrong tool for knowledge injection over a well-designed RAG pipeline.

Finetuning is good for, like you said, doing things a particular way but that’s not the same thing as being good at knowledge injection and shouldn’t considered as such.

It’s also much easier to prevent a RAG pipeline from generating hallucinated responses. You cannot finetune that out of a model.