Comment by rishramanathan
2 years ago
Thanks! We’ve broken our evals down into three primary categories — integrity, consistency and performance.
Integrity tests tackle data quality issues (e.g. no PII in input data, no duplicate rows, schema checks on specific fields).
Consistency tests help ensure your fine-tuning & validation datasets are well constructed in relation to one another (e.g. don’t have overlap, are sized correctly), and your production data doesn’t drift from your reference data.
Performance tests are focused on your model outputs, and measure common metrics for each task (e.g. accuracy, F1, PR for classification) as well as custom metrics designed to be evaluated by an LLM (e.g. “make sure these outputs don’t contain profanity”). You can apply these metrics to specific subpopulations of your data by setting filters on your input fields.
Re: adding your own evals — yes, you can! The evals are not statically defined — they are flexible structures that allow you to customize them to your needs.
Re: importing evaluations from other libraries — this is something we’re adding more support for. We’ve just added an integration with Great Expectations, and can add an integration with OpenAI’s evals if that is something the community is interested in.
No comments yet
Contribute on Hacker News ↗