Comment by mdasen

21 hours ago

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.

Is there a leaderboard out there comparing harness results using the same models?

9 comments

mdasen

manx 19 hours ago

We probably want to compare the cartesian product of model+harness.

nikcub 13 hours ago

the most cited is terminal bench 2.0, but its also plagued by cheating accusations and benchmaxxing.

somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark

[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0

culi 18 hours ago

Maybe the future isn't a human-like centralized intelligence but an octopus-like decentralized intelligence where more focus is placed on making the harness itself "smart"

dominotw 17 hours ago
That would be counter to AI company goals. They want harness to be dumb and models to be smart so they can sell models.
- satvikpendem 14 hours ago
  
  Not really. Anthropic for example sells both the harness and the models as a unified kit via Claude Code, it is in their best interest to make sure both parts work as well as possible, via reinforcement learning of previous usage as well for new model performance increases.
- SwellJoe 16 hours ago
  
  https://en.wikipedia.org/wiki/Bitter_lesson
  History indicates you can't tool and harness your way to effectively competing against a smarter model with more compute.

isege 14 hours ago

Isn't that what terminal-bench does?

GodelNumbering 20 hours ago

I really wish there was! I thought of even creating one but it would be conflict of interest

alfiedotwtf 4 hours ago

For my local tests the past few months on the same local model, I’ve found Claude Code to be way better than OpenCode, and OpenCode to be better than Codex.