← Back to context

Comment by deaux

16 hours ago

1. Good point, didn't know about the timeouts, that's rough for the benchmarks. Though they IMO don't necessarily be "SWE-official" to have value, if the only difference is disabling those.

3. Maybe you could instead provide a measure of output tokens used (including thinking), as that's a reasonable measure for speed. I guess input tokens would be similar unless the AST usage and hashes etc increases them a lot? Seems unlikely.