← Back to context

Comment by BenGosub

6 days ago

In this case, for doing specific tasks, it makes much more sense to optimize the prompts and the whole flow with DSPy, instead of just fine tuning for each task.

It's not either/or. Generally you finetune when optimized many-shot still doesn't hit your desired quality bar. And it turns out with RL, things like system prompts matter a lot, so searching over prompts is a good idea even when reinforcing the desirable circuits.

  • I am not an expert in fine tuning, but in the company I work for our fine tuned model didn't do any noticeable difference.

A wonderful approach generally and something we also do to some extent, but not a substitute for fine-tuning in our case.

We are working in a domain where there is very limited training data, so what we really want is continued pre-training over a larger dataset. Absent that, fine-tuning is highly effective for non-NLP tasks.

That's only viable if the quality of the outputs can be automatically graded, reliably. GP's case sounds like one where that's probably possible, but for lots of specific tasks that isn't feasible, including the other ones he names:

> write poetry, give me advice on cooking, or translate to German

  • Certainly, in those cases one needs to be clever and design an evaluation framework that will grade based on soft criteria, or maybe use user feedback. Still, over time a good train-test database should be built and leveraging dspy will do improvements even in those cases.