Comment by stingraycharles
7 days ago
That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.
You’re right though, but evals are actually fairly tricky to write and maintain.
7 days ago
That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.
You’re right though, but evals are actually fairly tricky to write and maintain.
No comments yet
Contribute on Hacker News ↗