← Back to context

Comment by atherton33

1 day ago

I agree with you about what's described here.

There is engineering when this is done seriously, though.

Build a test set and design metrics for it. Do rigorous measurement on any change of the system, including the model, inference parameters, context, prompt text, etc. Use real statistical tests and adjust for multiple comparisons as appropriate. Have monitoring that your assumptions during initial prompt design continue to be valid in the future, and alert on unexpected changes.

I'm surprised to see none of that advice in the article.