Comment by jjani

5 days ago

That's only viable if the quality of the outputs can be automatically graded, reliably. GP's case sounds like one where that's probably possible, but for lots of specific tasks that isn't feasible, including the other ones he names:

> write poetry, give me advice on cooking, or translate to German

Certainly, in those cases one needs to be clever and design an evaluation framework that will grade based on soft criteria, or maybe use user feedback. Still, over time a good train-test database should be built and leveraging dspy will do improvements even in those cases.