← Back to context

Comment by onlyrealcuzzo

22 days ago

> We are working on agent-level evals, but those are unfortunately much harder to get right.

It's unfortunately a nearly impossible task, as the models change regularly (without letting you know), so you have a moving (invisible) target that's 1) hard to test exhaustively, and 2) very expensive to test with any low margin of error.

This is why no one does it and just makes broad sweeping unverified claims instead.

If you figure out how to do it... You should probably just get a job at Anthropic or OpenAI and make $2M+ per year...