Comment by sshine
2 hours ago
I second this; even switching between minor versions of a model, you need to adjust prompts: the new model is better by implying a bunch of things that, when included in the prompt, will overdo that thing.
Assessing quality of output is often not trivial, either. Typically, problems that are solved by offloading something to an LLM are super subjective, and customers “feel” something is different is vulnerable.
We try to quantify output differences by many different similarity metrics. But a lot of energy goes into subjectively evaluating if something still works.
No comments yet
Contribute on Hacker News ↗