Comment by minimaxir

5 hours ago

The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.

It's a fun way to quantify the real-world performance between models that's more practical and actionable.