Comment by mynti

4 hours ago

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

23 comments

mynti

aoeusnth1 10 minutes ago

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

Workaccount2 2 hours ago

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

htrp 2 hours ago

more playing to their strengths. a giant chunk of their usage data is basically code gen

vharish 4 hours ago

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

xnx 2 hours ago

Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

Palmik 4 hours ago

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

HereBePandas 4 hours ago
Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.
It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.
- Palmik 3 hours ago
  
  All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.
  What do you mean by "standard eval harness"?
- enraged_camel 4 hours ago
  
  Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?
  
  2 replies →

felipeerias 4 hours ago

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

_factor 3 hours ago

This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.
The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

tosh 4 hours ago

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

raducu 3 hours ago

> This might also hint at SWE struggling to capture what “being good at coding” means.
My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

alyxya 3 hours ago

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

macrolime 3 hours ago

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

HereBePandas 4 hours ago

[comment removed]

Palmik 4 hours ago
The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.
- HereBePandas 4 hours ago
  
  You're right on SWE Bench Verified, I missed that and I'll delete my comment.
  GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.
  
  1 reply →

varispeed 3 hours ago

Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

baq 3 hours ago

I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.