Comment by falcor84

3 months ago

That looks impressive, but some of the are a bit out of date.

On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.

6 comments

falcor84

NitpickLawyer 3 months ago

What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes.

I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.

Miraste 3 months ago

I've noticed that too. I suspect it has broader general knowledge than the others, because Google presumably has the broadest training set.

sigmar 3 months ago

That's a different model not in the chart. They're not going to include hundreds of fine tunes in a chart like this.

Taek 3 months ago

It's also worth pointing out that comparing a fine-tune to a base model is not apples-to-apples. For example, I have to imagine that the codex finetune of 5.1 is measurably worse at non-coding tasks than the 5.1 base model.
This chart (comparing base models to base models) probably gives a better idea of the total strength of each model.
falcor84 3 months ago

It's not just one of many fine tunes; it's the default model used by OpenAI's official tools.