Comment by w-m
19 hours ago
At the current rate of progress I'm wondering how long it will take for llm agents to be able to rewrite/translate complete projects into another language. SQLite may not be the best candidate, due to the hidden test suite. But CPython or Clang or binutils or...
The RIIR-benchmark: rewrite CPython in Rust, pass the complete test suite, no performance regressions, $100 budget. How far away are we there, a couple months? A few years? Or is it a completely ill-posed problem, due to the test suite being tied to the implementation language?
What’s the point?
A clearly defined/testable long-horizon task: demonstrating the capability of planning and executing projects that overrun current llm's context windows by several orders of magnitude.
Single-issue coding benchmarks are getting saturated, and I'm wondering when we'll get to a point where coding agents will be able to tackle some long-running projects. Greenfield projects are hard to benchmark. So creating code or porting code from one language to another for an established project with a good test suite should make for an interesting benchmark, no?