Comment by Palmik

3 months ago

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

6 comments

Palmik

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

Palmik 3 months ago
All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.
What do you mean by "standard eval harness"?
- lucassz 3 months ago
  
  I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.
enraged_camel 3 months ago
Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?
- HereBePandas 3 months ago
  
  Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.
  Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.
  
  1 reply →