Comment by w-m

11 days ago

A clearly defined/testable long-horizon task: demonstrating the capability of planning and executing projects that overrun current llm's context windows by several orders of magnitude.

Single-issue coding benchmarks are getting saturated, and I'm wondering when we'll get to a point where coding agents will be able to tackle some long-running projects. Greenfield projects are hard to benchmark. So creating code or porting code from one language to another for an established project with a good test suite should make for an interesting benchmark, no?

0 comments

w-m

No comments yet

Contribute on Hacker News ↗