Comment by dandelionv1bes
4 days ago
Something I’ve been thinking about is how as end stage users (eg building our own “thing” on top of an LLM) we can broadly verify it’s doing what we need without benchmarks. Does a set of custom evals built out over time solve this? Is there more we can do?
No comments yet
Contribute on Hacker News ↗