Comment by GodelNumbering

21 hours ago

Good points.

1. I have been trying to benchmark openweights models but keep running into timeouts due to slow inference (terminal bench tasks have strict timeouts that you are not allowed to modify). Posted my frustration here https://www.reddit.com/r/LocalLLaMA/comments/1stgt39/the_fru...

2. Done (updated github readme)

3. Yes, on an average the times were shorter, but I did not benchmark it because at random times, the model outputs get slower, so it is not a rigorous benchmark

4. Added info on this too

1 comment

GodelNumbering

deaux 19 hours ago

1. Good point, didn't know about the timeouts, that's rough for the benchmarks. Though they IMO don't necessarily be "SWE-official" to have value, if the only difference is disabling those.

3. Maybe you could instead provide a measure of output tokens used (including thinking), as that's a reasonable measure for speed. I guess input tokens would be similar unless the AST usage and hashes etc increases them a lot? Seems unlikely.