Comment by w-m
18 hours ago
A clearly defined/testable long-horizon task: demonstrating the capability of planning and executing projects that overrun current llm's context windows by several orders of magnitude.
Single-issue coding benchmarks are getting saturated, and I'm wondering when we'll get to a point where coding agents will be able to tackle some long-running projects. Greenfield projects are hard to benchmark. So creating code or porting code from one language to another for an established project with a good test suite should make for an interesting benchmark, no?
No comments yet
Contribute on Hacker News ↗