← Back to context

Comment by falcor84

7 hours ago

I just saw that Terminal Bench introduced a new evaluation approach based on the new Harbor tool [0] and was surprised to see that it completely reshuffled the leaderboard, with the top 4 places now held by variants of gpt-5, whereas in terminal-bench@1.0 you had to scroll down to the 7th place to see gpt-5.

Does anyone here have any insight on whether this genuinely reflects capabilities better? I'm asking because last I checked, Codex+gpt-5 significantly underperformed Claude Code for my use case.

[0] https://github.com/laude-institute/harbor