Comment by elzbardico

3 days ago

Even SOTA models when used in agents in simple NLP tasks such as text classification still fail more times than acceptable when evaluated against a realistic evaluation dataset with sufficient example variety and with some adversarial prompts included.

Improving such uses cases is mostly an artisanal endeavor, sometimes a few-shot prompt improves things, sometimes it improves things at the expense of kind of overfitting it, sometimes structured reasoning works, sometimes it doesn't, or sometimes it works and then the latency and token explodes, etc etc....

And yet a lot of teams don't see this problem because they don't care much about evaluations, and will only find this issues in production a few months after deployment.

Are those who care about evaluation luddites?