Comment by himata4113
9 hours ago
I already felt that gemini 3 proved what is possible if you train a model for efficiency. If I had to guess the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.
They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.
I feel like google will surprise everyone with a model that will be an entire generation beyond SOTA at some point in time once they go from prototyping to making a model that's not a preview model anymore. All models up till now feel like they're just prototypes that were pushed to GA just so they have something to show to investors and to integrate into their suite as a proof of concept.
> If I had to guess the pro and flash variants are 5x to 10x smaller than opus and gpt-5 class models.
I really doubt it, especially Pro. If anything I wouldn't be surprised if their hardware lets them run bigger models more cheaply and quickly than the others. Pro is probably smaller than GPT 5.4 and Opus 4.6 (looks like 4.7 decreased in size), but 5x seems way too much. IMO Gemini 3 Pro is the most "intelligent" in an all-round human way. Especially in the humanities. It's highly knowledgeable and undeniably the number one model at producing natural text in a large number of (human!) languages. The difference becomes especially large for more niche languages. That does not suggest a smaller model, more the opposite. The top 4 models at multilinguality are all Google : 1. 3 Pro 2. 3 Flash 3. 2.5 Pro 4. 2.5 Flash. Even the biggest OpenAI and Anthropic models can't compete in that dimension.
It's definitely weaker at math and much worse at agentic things. Gemini chat as an app is also lightyears behind, it's barely different from ChatGPT at release over 3 yeaes ago. These things make it feel much weaker than it is.
Regarding Anthropic, they used to make best multilingual and generalist models, it's their policy thing, not a capability issue. Claude 3 was best at this, including dead and low-resource languages. Neither modern Claude nor Gemini are remotely close to what Claude 3 was capable of (e.g. zero-shot writing styles). Anthropic basically reversed their "character training" policy and started optimizing their models for code generation at the cost of everything else, starting with Sonnet 3.5. Claude 4 took a huge hit in multilingual ability
GPT, on the other hand, was always terrible at languages, except for the short-lived gpt-4.5-preview.
All modern models including Gemini have bugs in basic language coherency - random language switching, self-correction attempts resulting in hallucinations etc. I speculate it's a problem with heavy RL with rewards and policies not optimized for creative writing.
The benchmarks don’t seem to say that language ability has gotten worse?
3 replies →
Aistudio should be their default app
generally speaking
ultra ~ mythos ~ gpt-4.5 ~ 4x behemoth
pro ~ opus ~ 2x maverick
flash ~ sonnet ~ scout ~ other 20-30b active Chinese models
> They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.
Agreed, Gemini-cli is terrible compared to CC and even Codex.
But Google is clearly prioritizing to have the best AI to augment and/or replace traditional search. That's their bread and butter. They'll be in a far better place to monetize that than anyone else. They've got a 1B+ user lead on anyone - and even adding in all LLMs together, they still probably have more query volume than everyone else put together.
I hope they start prioritizing Gemini-cli, as I think they'd force a lot more competition into the space.
> Agreed, Gemini-cli is terrible compared to CC and even Codex.
Using it with opencode I don't find the actual model to cause worse results with tool calling versus Opus/GPT. This could be a harness problem more than a model problem?
I do prefer the overall results with GPT 5.4, which seems to catch more bugs in reviews that Gemini misses and produce cleaner code overall.
(And no, I can't quantify any of that, just "vibes" based)
I wonder what I am missing, because I can use gemini-cli with English descriptions of features or entire projects and it just cranks out the code. Built a bunch of stuff with it. Can't think of anything it's currently lacking.
>> Can't think of anything it's currently lacking.
Speed? The pro models are slow for me
The model 3.1 pro model is good and i don't recognise the GP's complaint of broken tool calls but i'm only using via gemini cli harness, sounds like they might be hosting their own agentic loop?
Same. I've built dozens of small tools and scripts and never felt the need to try something else.
I thought the same for a long time, borderline unusable with loops/bizarre decisions compared to Claude Code and later Codex.
But I picked it up again about a month ago and I have been quite impressed. Haven’t hit any of those frustrating QoL issues yet it was famous for and I’ve been using it a few hours a day.
Maybe it will let me down sooner or later but so far it has been working really well for me and is pretty snappy with the auto model selection.
After cancelling my Claude Pro plan months ago due to Anthropic enshittification I’ve been nervous relying solely on Codex in case they do the same, so I’ve been glad to have it available on my Google One plan.
also, for incorporating into gsuite, youtube, maps, gcp and their other winning apps and behind-the-scenes infra...
Google doesn't need to give a shit, because so much of the internet is infested with with google ad trackers and adwords, and everybody uses Chrome, that they will continue to make billions even without AI. Facebook did the same with their pixel so they could soak up data.
Gemini will be dead in 2 years and there'll be something else, but the ad and search company will remain given that they basically own the world wide web.
Except now, so much of the WWW is filled with AI slop that it breaks the system.
Not only that, google has an advange because they don't need to always generate a response.
When a lot of people ask the same thing they can just index the questions, like a results on the search engine and recalculate it only so often,
IIRC when Gemini 3 Pro came out it was considered to be just about on par with whatever version of Claude was out then (4?). Now Gemini 3 is looking long in the tooth. Considering how many Chinese models have been released since then, and at least 2 or 3 versions of Claude, it's starting to look like Google is kind of sitting still here. Maybe you're right and they'll surprise us soon with a large step improvement over what they currently have. Note: I do realize that there's been a Gemini 3.1 release, but it didn't seem like a noticeable change from 3.
As other people are saying here: the Gemini models are mostly terrible at tool use and long context management. And maybe not quite as good with finicky "detail" parts of coding generally.
Where they excel is just total holistic _knowledge_ about the world. I don't like "talking" to it, because I kind of hate its tone, but I find Gemini generally extremely useful for research and analysis tasks and looking up information.
People who say Gemini is bad at long contexts are so wrong.
You can put whole 50,000 - 70,000 LOC codebase into Gemini 3.1 Pro context making it 800,000+ tokens, give it detailed task and ask for whole changed files back and it will execute it sometimes in one shot, sometimes in two. E.g depend on whatever stack you work with let you see all the errors at once so it can fix everything on single reply.
Yes it will give you back 5-15 files up to 4000 LOC total with only relevant parts changed.
This is terrible inefficient way to burn $10 of tokens in 20 minutes, but attention and 1:1 context retention is truly amazing.
PS: At the same time it is bad at tool use, but this have nothing to do with context.
Gemini had the best long context support for the longest time, and even now at >400k tokens it's still got the best long context recall.
Gemini is just not trained for autonomy/tool use/agentic behavior to the same degree as the other frontier models. Goog seems to emphasize video/images/scientific+world knowledge.
1 reply →
Their "preview" naming is pretty arbitrary. It's just their way to avoid making any availability or persistence promises, let alone guarantees. It's also a PR tactic to mask any failures by pretending it's beta quality.
I really wonder what I’m missing with Gemini. It’s a second rate model for me at best. I find it okay (not great) at collecting information and completely useless at agentic tasks. It’s like it’s always drunk. When the Claude credits expire in Antigravity, I’m done for the day.
> They produce drastically lower amount of tokens to solve a problem
I LOLed at this because I of the constant death loops that don’t even solve the problem at all.
Am I tripping or is this an AI reply? Like it barely has anything to do with the article other than both are related to AI
An AI reply would be more relevant to the headline / article, humans often write something tangential since we have more going on in our head and not just the context at hand while AI can't ignore context.
> a model that will be an entire generation beyond SOTA
That model would then be SOTA.
Tautologically you can't be better than SOTA
Interesting mix of words: "I felt" -> "proved" -> "guess". One of those is not like the others!
[flagged]
Is your friend on the JAX team?
I'm really struggling with terrible bloating today, but I deemed it too dangerous to release.
Thank you for your sacrifice. Could you speak to my dog please? You may wish to yell from a distance, actually.