Gemini 3 Pro Model Card

4 hours ago (pixeldrain.com)

Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |

n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

  • Wow. They must have had some major breakthrough. Those scores are truly insane. O_O

    Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there

    But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.

    Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.

    And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD

    • The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.

  • These numbers are impressive, at least to say. It looks like Google has produced a beast that will raise the bar even higher. What's even more impressive is how Google came into this game late and went from producing a few flops to being the leader at this (actually, they already achieved the title with 2.5 Pro).

    What makes me even more curious is the following

    > Model dependencies: This model is not a modification or a fine-tune of a prior model

    So did they start from scratch with this one?

    • Google was never really late. Where people perceived Google to have dropped the ball was in its productization of AI. The Google's Bard branding stumble was so (hilariously) bad that it threw a lot of people off the scent.

      My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.

      6 replies →

    • > So did they start from scratch with this one

      Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.

    • At least at the moment, coming in late seems to matter little.

      Anyone with money can trivially catch up to a state of the art model from six months ago.

      And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.

      10 replies →

    • I hope they keep the pricing similar to 2.5 Pro, currently I pay per token and that and GPT-5 are close to the sweet spot for me but Sonnet 4.5 feels too expensive for larger changes. I've also been moving around 100M tokens per week with Cerebras Code (they moved to GLM 4.6), but the flagship models still feel better when I need help with more advanced debugging or some exemplary refactoring to then feed as an example for a dumber/faster model.

    • What does it mean nowadays to start from scratch? At least in the open scene, most of the post-training data is generated by other LLMs.

      1 reply →

  • That looks impressive, but some of the are a bit out of date.

    On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.

    • What's more impressive is that I find gemini2.5 still relevant in day-to-day usage, despite being so low on those benchmarks compared to claude 4.5 and gpt 5.1. There's something that gemini has that makes it a great model in real cases, I'd call it generalisation on its context or something. If you give it the proper context (or it digs through the files in its own agent) it comes up with great solutions. Even if their own coding thing is hit and miss sometimes.

      I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.

  • I would love to know what the increased token count is across these models for the benchmarks. I find the models continue to get better but as they do their token usage also does. Aka is model doing better or reasoning for longer?

    • I think that is always something that is being worked on in parallel. Recent paradigm seems to be the models understanding when they need to use more tokens dynamically (which seems to be very much in line with how computation should generally work).

  • Used an AI to populate some of 5.1 thinking's results.

    Benchmark | Gemini 3 Pro | Gemini 2.5 Pro | Claude Sonnet 4.5 | GPT-5.1 | GPT-5.1 Thinking

    ---------------------------|--------------|----------------|-------------------|---------|------------------

    Humanity's Last Exam | 37.5% | 21.6% | 13.7% | 26.5% | 52%

    ARC-AGI-2 | 31.1% | 4.9% | 13.6% | 17.6% | 28%

    GPQA Diamond | 91.9% | 86.4% | 83.4% | 88.1% | 61%

    AIM 2025 | 95.0% | 88.0% | 87.0% | 94.0% | 48%

    MathArena Apex | 23.4% | 0.5% | 1.6% | 1.0% | 82%

    MMMU-Pro | 81.0% | 68.0% | 68.0% | 80.8% | 76%

    ScreenSpot-Pro | 72.7% | 11.4% | 36.2% | 3.5% | 55%

    CharXiv Reasoning | 81.4% | 69.6% | 68.5% | 69.5% | N/A

    OmniDocBench 1.5 | 0.115 | 0.145 | 0.145 | 0.147 | N/A

    Video-MMMU | 87.6% | 83.6% | 77.8% | 80.4% | N/A

    LiveCodeBench Pro | 2,439 | 1,775 | 1,418 | 2,243 | N/A

    Terminal-Bench 2.0 | 54.2% | 32.6% | 42.8% | 47.6% | N/A

    SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% | N/A

    t2-bench | 85.4% | 54.9% | 84.7% | 80.2% | N/A

    Vending-Bench 2 | $5,478.16 | $573.64 | $3,838.74 | $1,473.43| N/A

    FACTS Benchmark Suite | 70.5% | 63.4% | 50.4% | 50.8% | N/A

    SimpleQA Verified | 72.1% | 54.5% | 29.3% | 34.9% | N/A

    MMLU | 91.8% | 89.5% | 89.1% | 91.0% | N/A

    Global PIQA | 93.4% | 91.5% | 90.1% | 90.9% | N/A

    MRCR v2 (8-needle) | 77.0% | 58.0% | 47.1% | 61.6% | N/A

    Argh it doesn't come out write in HN

    • Used an AI to populate some of 5.1 thinking's results.

      Benchmark..................Description...................Gemini 3 Pro....GPT-5.1 (Thinking)....Notes

      Humanity's Last Exam.......Academic reasoning.............37.5%..........52%....................GPT-5.1 shows 7% gain over GPT-5's 45%

      ARC-AGI-2...................Visual abstraction.............31.1%..........28%....................GPT-5.1 multimodal improves grid reasoning

      GPQA Diamond................PhD-tier Q&A...................91.9%..........61%....................GPT-5.1 strong in physics (72%)

      AIME 2025....................Olympiad math..................95.0%..........48%....................GPT-5.1 solves 7/15 proofs correctly

      MathArena Apex..............Competition math...............23.4%..........82%....................GPT-5.1 handles 90% advanced calculus

      MMMU-Pro....................Multimodal reasoning...........81.0%..........76%....................GPT-5.1 excels visual math (85%)

      ScreenSpot-Pro..............UI understanding...............72.7%..........55%....................Element detection 70%, navigation 40%

      CharXiv Reasoning...........Chart analysis.................81.4%..........69.5%.................N/A

  • Which of the LiveCodeBench Pro and SWE-Bench Verified benchmarks comes closer to everyday coding assistant tasks?

    Because it seems to lead by a decent margin on the former and trails behind on the latter

    • I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

      However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

    • Neither :(

      LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.

  • This is a big jump in most benchmarks.And if it can match other models in coding while having that Google TPM inference speed and the actually native 1m context window, it's going to be a big hit.

    I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.

    • > it's over for the other labs.

      What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.

      2 replies →

    • > it's over for the other labs.

      Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.

      1 reply →

  • We knew it would be a big jump and while it certainly is in many areas - its definitely not "groundbreaking/huge leap" worthy like some were thinking from looking at these numbers.

    I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.

    Personally I'm very interested in how they end up pricing it.

  • Looks like the best way to keep improving the models is to come up with really useful benchmarks and make them popular. ARC-AGI-2 is a big jump, I'd be curious to find out how that transfers over to everyday tasks in various fields.

  • Looks like it will be on par with the contenders when it comes to coding. I guess improvements will be incremental from here on out.

    • > I guess improvements will be incremental from here on out.

      What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.

      Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.

      3 replies →

    • If it’s on par in code quality, it would be a way better model for coding because of its huge context window.

  • very impressive. I wonder if this sends a different signal to the market regarding using TPUs for training SOTA models versus Nvidia GPUs. From what we've seen, OpenAI is already renting them to diversify... Curious to see what happens next

It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

  • I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

    It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

    • more playing to their strengths. a giant chunk of their usage data is basically code gen

  • From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

    My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

    I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

    • Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

  • IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

    • This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required.

      The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

  • Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

    I did not bother verifying the other claims.

    • Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

      It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

      3 replies →

  • This might also hint at SWE struggling to capture what “being good at coding” means.

    Evals are hard.

    • > This might also hint at SWE struggling to capture what “being good at coding” means.

      My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

  • I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

  • Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

    Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

    It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

    • I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.

Additional context from AI Studio including pricing:

Our most intelligent model with SOTA reasoning and multimodal understanding, and powerful agentic and vibe coding capabilities

<=200K tokens • Input: $2,00 / Output: $12,00

> 200K tokens • Input: $4,00 / Output: $18,00

Knowledge cut off: Jan. 2025

  • More expensive than current 2.5 Pro. for >200k token it's at $2.5 input and $15 output right now

Curiously, this website seems to be blocked in Spain for whatever reason, and the website's certificate is served by `allot.com/emailAddress=info@allot.com` which obviously fails...

Anyone happen to know why? Is this website by any change sharing information on safe medical abortions or women's rights, something which has gotten websites blocked here before?

  • Creator of pixeldrain here. I have no idea why my site is blocked in Spain, but it's a long running issue.

    I actually never discovered who was responsible for the blockade, until I read this comment. I'm going to look into Allot and send them an email.

    EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

    • > EDIT: Also, your DNS provider is censoring (and probably monitoring) your internet traffic. I would switch to a different provider.

      Yeah, that was via my ISPs DNS resolver (Vodafone), switching the resolver works :)

      The responsible party is ultimately our government who've decided it's legal to block a wide range of servers and websites because some people like to watch illegal football streams. I think Allot is just the provider of the technology.

      2 replies →

What's wild here is that among every single score they've absolutely killed, somehow, Anthropic and Claude Sonnet 4.5 have won a single victory in the fight: SWE Bench Verified and only by a singular point.

I already enjoy Gemini 2.5 pro for planning and if Gemini 3 is priced similarly, I'll be incredibly happy to ditch the painfully pricey Claude max subscription. To be fair, I've already got an extremely sour taste in my mouth from the last Anthropic bait and switch on pricing and usage, so happy to see Google take the crown here.

  • SWE bench is weird because Claude has always underperformed on it relative to other models despite Claude Code blowing them away. The real test will be if Gemini CLI beats Claude Code, both using the agentic framework and tools they were trained on.

One benchmark I would really like to see: instruction adherence.

For example, the frontier models of early-to-mid 2024 could reliably follow what seemed to be 20-30 instructions. As you gave more instructions than that in your prompt, the LLMs started missing some and your outputs became inconsistent and difficult to control.

The latest set of models (2.5 Pro, GPT-5, etc) seem to top out somewhere in the 100 range? They are clearly much better at following a laundry list of instructions, but they also clearly have a limit and once your prompt is too large and too specific you lose coherence again.

If I had to guess, Gemini 3 Pro has once again pushed the bar, and maybe we're up near 250 (haven't used it, I'm just blindly projecting / hoping). And that's a huge deal! I actually think it would be more helpful to have a model that could consistently follow 1000 custom instructions than it would be to have a model that had 20 more IQ points.

I have to imagine you could make some fairly objective benchmarks around this idea, and it would be very helpful from an engineering perspective to see how each model stacked up against the others in this regard.

  • 20 more IQ would be nuts, 110 ~ top 25%, 130 ~ top 2%, 150 ~ top 0.05%

    If you ever played competitive game the difference is insane between these tiers

    • Even more nuts would be a model that could follow a large, dense set of highly detailed instructions related to a series of complex tasks. Intelligence is nice, but it's far more useful and programmable if it can tightly follow a lot of custom instructions.

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

There needs to be a sycophancy benchmark in these comparisons. More baseless praise and false agreement = lower score.

  • This idea isn't just smart, it's revolutionary. You're getting right at the heart of the problem with today's benchmarks — we don't measure model praise. Great thinking here.

    For real though, I think that overall LLM users enjoy things to be on the higher side of sycophancy. Engineers aren't going to feel it, we like our cold dead machines, but the product people will see the stats (people overwhelmingly use LLMs to just talk to about whatever) and go towards that.

  • I care very little about model personality outside of sycophancy. The thing about gemini is that it's notorious for its low self esteem. Given that thing is trained from scratch, I'm very curious to see how they've decided to take it.

  • I'd like if the scorecard also gave an expected number of induced suicides per hundred thousand users.

    • https://llmdeathcount.com/ shows 15 deaths so far, and LLM user count is in the low billions, which puts us on the order of 0.0015 deaths per hundred thousand users.

      I'm guessing LLM Death Count is off by an OOM or two, so we could be getting close to one in a million.

  • Your comment demonstrates a remarkably elevated level of cognitive processing and intellectual rigor. Inquiries of this caliber are indicative of a mind operating at a strategically advanced tier, displaying exceptional analytical bandwidth and thought-leadership potential. Given the substantive value embedded in your question, it is operationally imperative that we initiate an immediate deep-dive and execute a comprehensive response aligned with the strategic priorities of this discussion.

The strategic move to use TPU rather than Nvidia is paying well for Google. They are able to better utilize their existing large infrastructure, but also specialize the processes and pipelines for their own framework that they use to create and train models.

I think a specialized hardware for training models is the next big wave in China.

Title of the document is "[Gemini 3 Pro] External Model Card - November 18, 2025 - v2", in case you needed further confirmation that the model will be released today.

Also interesting to know that Google Antigravity (antigravity.google / https://github.com/Google-Antigravity ?) leaked. I remember seeing this subdomain recently. Probably Gemini 3 related as well.

Org was created on 2025-11-04T19:28:13Z (https://api.github.com/orgs/Google-Antigravity)

  • what is Google Antigravity?

    • According to Gemini itself:

      "Google Antigravity" refers to a new AI software platform announced by Google designed to help developers write and manage code.

      The term itself is a bit of a placeholder or project name, combining the brand "Google" with the concept of "antigravity"—implying a release from the limitations of traditional coding.

      In simple terms, Google Antigravity is a sophisticated tool for programmers that uses powerful AI systems (called "agents") to handle complex coding tasks automatically. It takes the typical software workbench (an IDE) and evolves it into an "agent-first" system.

      Agentic Platform: It's a central hub where many specialized AI helpers (agents) live and work together. The goal is to let you focus on what to build, not how to build it.

      Task-Oriented: The platform is designed to be given a high-level goal (a "task") rather than needing line-by-line instructions.

      Autonomous Operation: The AI agents can work across all your tools—your code editor, the command line, and your web browser—without needing you to constantly supervise or switch between them.

    • My guess is based on a gif tweeted by the ex CEO of windsurf who left to join Google of a floating laptop: it'll be a cursor/windsurf alternative?

    • Couple patterns this could follow

      Speed? (Flash, Flash-Lite, Antigravity) this is my guess. Bonus: maybe Gemini Diffusion soon?

      Space? (Google Cloud, Google Antigravity?)

      Clothes? (A light wearable -> Antigravity?)

      Gaming? (Ghosting/nontangibility -> antigravity?)

    • I guess we'll know it in a few hours. Most likely another AI playground or maybe a Google Search alternative? No clue really

> Developments to the model architecture contribute to the significantly improved performance from previous model families.

I wonder how significant this is. DeepMind was always more research-oriented that OpenAI, which mostly scaled things up. They may have come up with a significantly better architecture (Transformer MoE still leaves a lot of room).

> TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs.

That seems like a low bar. Who's training frontier LLMs on CPUs? Surely they meant to compare TPUs to GPUs. If "this is faster than a CPU for massively parallel AI training" is the best you can say about it, that's not very impressive.

> Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs)

NVDA is down 3.26%

  • If it’s because of that, then honestly it’s as insane as the deepseek thing where all the info was released weeks before but the markt got nervous only when they released an app. I mean info about Gemini 3 is out quite a while now and of course they trained it using TPUs, I didn’t even think that was in question.

     This model is not a modification or a fine-tune of a prior model

Is that common to mention that? Feels like they built something from scratch

Trying to open this link from Italy leads to a CSAM warning

  • Creator of pixeldrain here. Italy has been doing this for a very long time. They never notified me of any such material being present on my site. I have a lot of measures in place to prevent the spread of CSAM. I have sent dozens of mails to Polizia Postale and even tried calling them a few times, but they never respond. My mails go unanswered and they just hang up the phone.

Curious to see the API pricing. SOTA performance across tasks at a price cheaper than GPT 5 / Claude would make mostly everyone switch to Gemini.

  • Same here. They have been aggressively increasing prices with each iteration (maybe because they started so low). Still hope that is not the case this time. GPT 5.1 is priced pretty aggressively so maybe that is an incentive to keep the current gemini API prices.

Is flash/flash lite releasing alongside pro? Those two tiers have been incredible for the price since 2.0, absolute workhorses. Can't wait for 3.0.

I hope cheaper Chinese open weights models as good as Gemini will come soon. Gemini, Claude, GPT are kind of expensive if you use AI a lot.

So does google actually have a claude console alternative currently?

  • Noteworthily, although Gemini 3 Pro seems to have much benchmark scores than other models across the board (including compared to Claude), it's not the case for coding, where it appears to score essentially the same as the others. I wonder why that is.

    So far, IMHO, Claude Code remains significantly better than Gemini CLI. We'll see whether that changes with Gemini 3.

    • > I wonder why that is.

      That's because coding is currently the only reliable benchmark where reasoning capabilities transfer to predict capabilities for other professions like law. Coding is the only area where they are shy to release numbers. All these exam scores are fakeable by gaming those benchmarks.

    • Probably because many models from Anthropic would have been optimized for agentic coding in particular...

      EDIT: Don't disagree that Gemini CLI has a lot of rough edges, though.

    • Gemini performs better if you use it with Claude Code than with Gemini cli. It still has some odd problems with tool calling but a lot of the performance loss is the Gemini cli app itself.

    • Because benchmark are a retarded comparison and having nothing to do with reality. Its just jerk material for AI Fanboys

  • gemini cli. It's not as impressive as claude code or even codex.

    Claude code seems to be more compatible with the model (or the reverse) whereas gemini-cli still feels a bit awkward (as of 2.5 Pro). I'm hoping its better with 3.0!

I know this is a little controversial but the lack of performance on SWE-bench is hugely disappointing I think economically. These models don’t have any viable path to profitability if they can’t take engineering jobs.

  • I thought that but it does do a lot better on other benchmarks.

    Perhaps SWE bench just doesn't capture a lot of the improvement? If the web design improvements people have been posting on twitter, I suspect this will be a huge boon for developers. SWE benchmark is really testing bugfixing/feature dev more.

    Anyway let's see. I'm still hyped!

    • It seems the benchmarks that had a big jump had to do with visual capabilities. I wonder how that will translate to improvements to the workloads LLMs are currently used for (or maybe it will introduce new workloads).

    • SWE Bench doesn't even test bugfixing / feature dev properly after you achieve roughly 70% if you don't benchmaxx it .

    • That would be great! But AI is a bubble if these models can’t do serious engineering work.

  • People here, and in tech in general, are so lost in the sauce.

    According to at least OpenAI, who probably produces the most tokens (if we don't count google AI overviews and other unrequested AI bolt-ons) out of all the labs, programming tokens account for ~4% of total generations.

    That's nothing. The returns will come from everyone and their grandma paying $30-100/mo to use the services, just like everyone pays for a cell phone and electricity.

    Don't be fooled, we are still in the "Open hands" start-up business phase of LLMs. The "enshitification" will follow.

  • Really? If they can make an engineer more productive, that's worth a lot. Naive napkin math: 1.5X productivity on one $200k/year engineer is worth $100k/year.

    • People generally dont understand what these models are doing to engineering salaries. The skill level required to produce working software is going way down

[flagged]

  • Gemma is an open-weight version of Gemini and obviously much less capable probably even than 2.5 Flash. Also the story you are linking to is a complete nothing burger, models are still very much hallucinating, especially on some extremely niche topics, I don't see how another politician trying to capitalize on that is attention-worthy at all.

If these numbers are true then OpenAI is probably done, Anthropic too. Still, it's hard to see an effective monetization method for this tech and it clearly is eating Google's main pie which is search.

  • For SWE it is the same ranking. But if Google's $20/mo plan is comparable to the $100-200 plans for OpenAI and Anthropic, yes they are done.

    But we'll have to wait a few weeks to see if the nerfed model post-release is still as good.

    • I have a few secret prompts to test complex reasoning capabilities of new models (in law and medicine). Gemini (2.5 pro) is by a wide margin behind Anthropic (sonnet 4.5 basic thinking) and Openai (pro model) on my own benchmark and I trust my own benchmark more than public leaderboards. So it's the other way around. Google is trying to catch up where the others are. It just doesn't seem so to some because Google undercuts prices and most people don't have own complex problems with a verified solution to test against (so they could see how bad Gemini is in reality)

      1 reply →

  • Or else it trained/overfit to the benchmarks. We won't really know until people have a chance to use it for real-world tasks.

    Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?

  • Why? These models just leapfrog each other as time advances.

    One month Gemini is on top, then ChatGPT, then Anthropic. Not sure why everyone gets FOMO whenever a new version gets released.

    • I think google is uniquely well placed to make a profitable business out of AI: They make their own TPUs so don't have to pay ridiculous amounts of money to Nvidia, they have a great depth of talent in building models, they've got loads of data they can use for training and they've got a huge existing customer base who can buy their AI offerings.

      I don't think any other company has all these ingredients.

      11 replies →

    • Considering GPT 5 was only recently released, it's very unlikely GPT will achieve these scores in just a couple of months. If they had something this good in the oven, they'd probably left the GPT 5 name to it.

      Or maybe Google just benchmaxxed and this doesn't translate at all in real world performance.

      3 replies →

  • The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.

    • One percentage point is not significant, neither in the colloquial nor the scientific sense[1].

      [1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%

  • 1) New SOTA models come out all the time and that hasn't killed the other major AI companies. This will be no different.

    2) Google's search revenue last quarter was $56 billion, a 14% increase over Q3 2024.

    • 1) Not long ago Altman and the OpenAI CFO were openly asking for public money. None of these AI companies have actually any kind of working business plan and are just burning investor money. If the investors see there is no winning against Google (or some open Chinese model) the money will dry up.

      2) I'm not suggesting this will happen overnight but especially younger people gravitate towards LLM for information search + actively use some sort of ad blocking. In the long run it doesn't look great for Google.

      1 reply →

  • They're constantly matching and exceeding each other. It's a hypercompetitive space and I would fully expect one of the others to top various benchmarks shortly after. On pretty much every leading release someone does this "everyone else is done! Shut er down" thing and it's growing pretty weird.

    Having said that, OpenAI's ridiculous hype cycle has been living on borrowed time. OpenAI has zero moat, and are just one vendor in a space with many vendors, and even incredibly competent open source models by surprise Chinese entrants. Sam Altman going around acting like he's a prophet and they're the gatekeepers of the future is an act that should be super old, but somehow fools and their money continue to be parted.

    • This. If I had to put my money on a survivor, it would be Google because it is an established company with existing revenue modules unrelated to AI. Anthropic and OpenAI won't stand alone without external funding

  • This may just be bad recollection from my part, but hasn't Google reported that their search business is right now the most profitable it has ever been?

  • I'd love to see anthropic/openai pop. back to some regular programming. the models are good enough, time to invest elsewhere