Comment by falcor84
8 hours ago
I just saw that Terminal Bench introduced a new evaluation approach based on the new Harbor tool [0] and was surprised to see that it completely reshuffled the leaderboard, with the top 4 places now held by variants of gpt-5, whereas in terminal-bench@1.0 you had to scroll down to the 7th place to see gpt-5.
Does anyone here have any insight on whether this genuinely reflects capabilities better? I'm asking because last I checked, Codex+gpt-5 significantly underperformed Claude Code for my use case.
No comments yet
Contribute on Hacker News ↗