Comment by zachdotai
15 hours ago
I wrote about this recently here: https://fabraix.com/blog/adversarial-cost-to-exploit
I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.
No comments yet
Contribute on Hacker News ↗