Comment by atherton33

4 months ago

I agree with you about what's described here.

There is engineering when this is done seriously, though.

Build a test set and design metrics for it. Do rigorous measurement on any change of the system, including the model, inference parameters, context, prompt text, etc. Use real statistical tests and adjust for multiple comparisons as appropriate. Have monitoring that your assumptions during initial prompt design continue to be valid in the future, and alert on unexpected changes.

I'm surprised to see none of that advice in the article.

1 comment

atherton33

ryoshu 4 months ago

This article talks about prompt evals https://www.anthropic.com/engineering/writing-tools-for-agen.... There are plenty of approaches to provide some degree of rigor around the slot machine output.