Gemini 3.5 Flash

10 hours ago (blog.google)

https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...

For those who would like to know the total and active parameter count of this model: even though Google doesn't disclose the model technicals, we can infer them within relatively tight margins based on what we do know.

We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.

We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).

We know Google intends to serve this model at a floor speed of around 280 tok/s too.

Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

Visual:

  ┌────────────────────────────────────────────────────────┐
  │                   TPU 8i VRAM (288 GB)                 │
  ├───────────────────────────┬────────────────────────────┤
  │   Static Model Weights    │  Dynamic Allocations &     │
  │   (250B - 300B @ Mixed    │  Compressed KV Caches      │
  │   FP4/FP8)                │  (RadixAttention / SRAM)   │
  │   ~110 GB - 150 GB        │  ~138 GB - 178 GB          │
  └───────────────────────────┴────────────────────────────┘

I do model serving optimization work. This is napkin math.

  • We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

    If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

    Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

    Data at https://gertlabs.com/rankings

    • We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

      Mythos is an exception that's larger.

  • If this is accurate it raises the question: why is this model so expensive? DeepSeek v4 Flash is 284B total/13B active, FP4/FP8 mixed, and only costs $0.14/$0.28 - even less from OpenRouter. Of course Gemini 3.5 Flash is most likely a better product, and therefore it can command a higher price from an economics perspective, but does this imply Google is taking roughly a 90% profit margin on inference? If so they're either very compute-limited or confident in the model and wanting to recoup training/fixed costs (or both).

    • Rumor is that GCP was happily selling compute to competitors. After all, under the hood, Google is closer to a federation than a corporation. The state of GCP doesn't care about the state of Gemini.

    • Well, we use flash models extensively (both 2.5 and 3.1) and I cannot overstate this, google cannot fucking serve them without 503s 70% of the time on most days

      I think it’s pure economics. Flash models are OP for the price, leads to too much demand, google cannot serve it. This is likely expensive to reduce load and hey, if it still makes money just keep the margin.

  • Do you have similar math for the flash-lite variant of the models? I'd be curious. Based on my testing / benchmark i think it's around the 100-120B mark.

    With the Pro variant being around 600B - 800B

    My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.

  • given this, is it safe to assume that inference pricing is barely related to cost to serve at this point and there is considerable margin?

  • Tell me more about what your day looks like. What do you think of the LLMOps books from Abi, in case you have read it ? Any other resources you can recommed?

The pelican is a lot: https://github.com/simonw/llm-gemini/issues/133#issuecomment...

Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.

Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...

  • That pelican looks like it's in Miami for a crypto conference.

  • This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

    edit: fixed human hallucination

    • When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

      I ask because:

      Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

      But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

      I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

      1 reply →

    • To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

      When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

      2 replies →

    • So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

    • It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

    • Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.

  • Forgetting the chainstay is typical of asking random people to draw a bicycle.

    https://www.gianlucagimini.it/portfolio-item/velocipedia/

    > most ended up drawing something that was pretty far off from a regular men’s bicycle

    • Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.

      Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.

  • I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

  • If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

  • We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

    That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

  • I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

  • I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).

  • I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.

  • I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

  • Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

  • I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

    Last time I tried, ChatGPT's image generator got the best result.

  • funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

    • That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.

      2 replies →

  • `<!-- Pelican Eye / Sunglasses (Cool Retro Aviators) -->`

    wtf

    `<!-- Gold Rim -->`

    WTF??

  • Love your pelicans, as always. And that one is... Wow.

    I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

    https://en.wikipedia.org/wiki/Synthwave

    Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

    To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

    • Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.

      So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.

    • At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.

Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

  • This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

    Gemini 2.5 flash (27 score): $172 (1.0x)

    Gemini 2.5 pro (35 score): $649 (3.8x)

    Gemini 3.0 Flash (46 score): $278 (1.6x)

    Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

    This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

  • They probably never intended to keep serving cheap models. This is a natural way to introduce the squeeze, now that they have people who built services on their API. It makes a lot of sense to have an abstraction layer where the provider doesn't matter. If you are working in Kotlin, Koog is excellent.

    • I think the big 3 are cartelizing and starting to ratchet up costs. GPT5.5 is not easily distinguishable from 5.1. I would it be shocked if we hit the ceiling and everyone is quietly positioning for the exit.

    • > now that they have people who built services on their API

      People really can’t wait to be the next Zynga

  • If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

    Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

    • This is not priced at inference cost.

      My guess: it's the price at which they make more money than if they rent the TPUs to other companies.

      The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?

      4 replies →

    • Its probably that in 1 or 2 years local (free) models will completely take the place of cheap models so cheap models need to move up the quality chain.

      You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.

      Flash seems to be targeting the near-frontier category.

      2 replies →

    • Prevailing wisdom is that serving LLMs at a profit is achievable... it's when you factor in the cost of training them that prices get astronomical real fast.

      Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.

      https://www.together.ai/pricing

      https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)

      Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.

      But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.

      ...my opinions here are of course, conjecture built on top of conjecture....

      2 replies →

    • This is trouble if you're not Google/OpenAI/Anthropic: they're all shifting towards pricing for the economic value of the knowledge work they're aiding.

      The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.

      That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.

      At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.

      (and in terms of timing, I think they're all well under the curve for pricing by economic value. Everyone is talking about Uber spending millions on tokens, but how much payroll did they pay while devs scrolled their phones and waited for CC to do their job?)

  • We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

    • We're having DeepSeek moments every couple of weeks.

      Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

      The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

      And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

      The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).

      8 replies →

    • Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.

      14 replies →

    • What we need is a deepseek moment in hardware ie China reaching parity on node size that is the only way latest computers let alone latest ai will be available to us in the future otherwise the profit margins will push most production to AI.

      4 replies →

  • 3.1 flash lite — $0.25/$1.50 — plus insanely fast.

    3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

    For comparison, Opus models are $5/$25

    • Opus 4.7 is smarter than even Gemini 3.1 Pro on nearly every metric, though. You're comparing apples to oranges. Gemini 3.1 Flash is somewhere in the neighborhood between current Haiku and Sonnet, I think? Still a better value than the Anthropic models, I guess, which are quite pricey.

      Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.

      3 replies →

  • To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

    I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

    That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

  • Their rationale might be that it’s size and intelligence are growing relative to the market.

    Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

    Question is are you going to persuade anyone with this argument?

    Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

  • To me this is almost like a tone-deaf naming change.

    Empty Slot (new Pro as Mythos competitor?)

    Old Pro -> now Flash

    Old Flash -> now Flash Lite

    Old Flash Lite -> now Gemma (and not served by Google)

    I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...

    This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.

    But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.

    And if we think this way, it's possible that prices are actually falling?

  • In general, Gemini flash is still relatively cheaper compared to the "mini" version of the other big 2. However, I agree that newer version seem to have multiple X price increase (similar to the new ChatGPT) and we certainly need competition from the open source models to keep these guys in check with pricing.

  • Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

    • These companies are unprofitable (as all companies at this stage and ambition should be) but I increasingly don't see any justification for the idea that it is fundamentally unprofitable.

      Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.

      Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.

      This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).

      It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).

      15 replies →

    • It is insanely profitable though, if you cut out r&d cost, plus the marketing and loss leaders. Don't let them gaslight you.

      Even anthropic who does not own any hardware still have a big margin providing claude models.

      8 replies →

  • Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

  • It might be temporary pricing given that 3.5 Flash is actually superior to the existing 3.1 Pro in almost all regards, so they're in a bit of a lurch as 3.1 Pro really doesn't make sense given that 3.5 Pro has been delayed a bit.

  • I use Gemini models in Junie daily. When I need accuracy I switch to Gemini 3.1 Pro Preview (why it is still in preview?), but it burns thru credits leaving me topping up $5 every day. 3.1 Flash lite is just not accurate enough. 3 Flash is sweet spot just as Jetbrains suggests it is.

    Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

  • That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

    I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

    • They have said AI will be priced like a utility, meaning $100-300 per month or so.

  • At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

    and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

  • Gemini 2.5 flash was the best Gemini model.

    Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

  > Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG

3.5 Flash: Thinking Medium - 7516 tokens

https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...

3.5 Flash: Thinking High - 7280 tokens

https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...

3.1 Pro - 28,258 tokens

https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...

Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.

Gemini 3.1 probation is literally the worst AI when I cycle from opus to got 5.5 then finally Gemini. It's actually insane that it's a frontier model. I rage at it more than my wife.

Taking into account that this is a flash model, it's a strong release. It's very fast and frontier-ish for the price.

Raw intelligence is high for a flash model. But Google's problem has always been productization and tool use, whereas raw intelligence is always competitive. It does not look like they solved that with this release -- in fact, their tool use delta (the improvement in scores when given arbitrary tools and a harness) has actually regressed from some previous models.

Data at https://gertlabs.com/rankings

Am I really so old that when someone says "Flash" my immediate response is... "consider HTML5 instead" ??

  • Very little of what made the Flash culture so fun made its way into HTML5.

  • They were CPU killers but man those Flash websites were gorgeous (talking mostly about MU Online "private" servers)

    • It was probably the right call at the time with low bandwidth. Nowadays I bet flash would execute faster than most js heavy sites :D

      1 reply →

  • The Flash designer was really nice. One thing the web kind of set back was all the RAD tools from the 90s and 2000s.

    • And there were some amazing RAD and prototyping tools in the 90s (mostly for DOS, but also for Windoze desktop apps.) You're right, we sort of gave up on the idea when everyone wanted to be seen as a "real" software engineer who knew how to sling Java on the back end.

  • Lol. Young uns!

    Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!

    Every time I have heard the word flash for goodness knows how many years.

  • Same here, and worst because in another thread users are generating animations.

I have google ai pro plan and tried antigravity with 3.5 flash but it used up all my quota in two prompts. If that is not a bug then it is seriously unusable.

  • Yesterday, or the day before, Google lowered the AI Pro quota from 33x standard usage to 4x.

    From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.

    The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol

    • The crunch is real.

      - The model is appox 3.3x cost. - The model is realistically almost 5x cost due to token usage - Google has TPUs to run this on (yet the cost) - Google has a lot more security and backup cash compared to all other AI companies, likely even combined (yet the cost)

      We can continue moving the goal posts, but it seems we're at a bit of a wall. Costs are increasing, intelligence is improving, but the cost is rising drastically.

      You'd think Google of all companies in the mix would be able to sustain lower costs with how integrated they are with TPU, Deepmind and effectively unlimited budget.

On my Agentic SQL benchmark it scores 19/25. That's... mediocre.

It means performs worse than 3.1 Flash Lite Preview (22/25), is slower (367s vs 142s) and is more expensive (75c vs 2c).

It is outperformed by Gemma4 26B-A4B in every way(!)

https://sql-benchmark.nicklothian.com/?highlight=google_gemi...

(Switch to the cost vs performance chart to see how far this is off the Pareto frontier)

Knowledge cutoff: January 2025

Latest update: May 2026

I have a very bad feeling about this lag.

  • At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.

    With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.

    Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.

    • > it maybe doesn't even matter that the models are using older data.

      This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.

      Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness — until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?

      1 reply →

  • Can you explain what you mean?

    • LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

      Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.

      8 replies →

  • you really shouldn't have them pulling facts from their weights, they need grounding from real data sources

Wow at the price hike. Still I think in the long run the Chinese will win if they're able to produce hardware comparable to Nvidia.

  • Why would the Chinese sell me nvidia cards? I can just by an AMD iGPU, and the perf/$ is much better than nvidia dGPUs.

    (Typed on a 2023 macbook perfectly capable of running the Chinese open weight models.)

  • I've had the $20 Gemini plan to use when my local setup runs into tougher problems and the throttling today has been bonkers. I canceled my subscription and will look into upgrading my local setup.

  • Doesn't need to be the Chinese. It can be anyone without stratospheric Nvidia margins. The Gold Rush phase of AI economy (aka "the bubble") is beginning to slow down and the Optimization phase is just beginning to ramp up (we see this with massive bumps to token cost and token burn rate of pretty much all frontier models, plus the general pivot away from your typical individual chat end-users to businesses and employees of said businesses) and there will come a time when "nvidia has the best software stack" will not mean much for the big players. Organically, I think it already kinda does, it's just masked with the inertia of massive circular deals and Nvidia selling its services to itself (entities it backs/invests in).

  • Aren't China also allowed to purchase Nvidia GPUs now too?

    • Up to the H200 iirc, but they haven't made a purchase yet afaik. The experts in such things believe if they do make a purchase, it will be a token one. Xi is pushing hard for indigenous production, not becoming "hooked" to American Ai chips like some (not so bright people) think we can cause to happen.

The price is crazy.

And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?

It seems like google does want us to use Chinese models.

  • What exactly are you doing with this that you can’t generate $1.50 of value per million tokens?

    • Wrong question.

      Right question: What exactly is Google's plan for the long term pricing of these models, and are we all going to be priced out in a year?

Beats 3.1 Pro for price per token, but artificial analysis is showing it's dumber per token and costs more overall

$1.5/m input tokens $9/m output tokens

6x the price of 3.1 flash lite

  • "Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.

  • I haven't used 3.5 at all yet, but previous Gemini (and Gemma models) are by far the most token light per task than any other model.

    Cost per task is a more productive measure, but obviously a more difficult one to benchmark.

  • I don't think input/output pricing matters, 90% of the cost is cache. $0.15 is pretty good, but still very expensive.

    • It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.

      3 replies →

    • Gemini caching is confusing though:

        $0.15 / million tokens
        $1.00 / 1,000,000 tokens per hour (storage price)
      

      I much prefer the OpenAI/DeepSeek way of pricing caching where you don't have to think about storage price at all - you pay for cached tokens if you reuse the same prefix within a (loosely defined) time period.

      1 reply →

    • In our experience, caching is not very reliable with google. We always get random cache misses that don't happen with other providers. We find OpenAI, Anthropic and Fireworks (which we use a lot) all have higher cache hit rates. So it's not only about the costs of cached token but also what kind of cached hit rate you get.

      3 replies →

How is this progress? The token cost just keeps going up and up. Flash is the new Pro? Do the models actually cost more to run or is it fattening margins?

Yikes. I think the concept of a 'flash' model is changing, no? Google used to market this as its lower-intelligence, faster, cheaper option. I appreciate that they are delivering on both of those, but personally I would appreciate if they could create an incremental knowledge improvement while holding price steady. Fortune 500 companies have to make their money I guess.

  • My guess is Gemini Pro coming later will be 2x more, bringing it comparable to Opus’s pricing.

  • That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...

worth noting that Google marked this stable rather than preview, which is unusual compared to their recent releases. Pair that with the 3x price hike and flash pricing now reads like long-term floor they want, not a temporary thing they will walk back later. But its hard to tell yet whether that's Google specifically reading the room or the whole industry quietly resetting the cheap-inference baseline.

Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.

They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.

  • > Engineers at google have publically stated that the models are too big and are far from their potencial

    Can you link to a source?

    • I wish I could, it was one of those youtube podcast type interviews with one of the engineers, there was a lot more shared, but that line stuck with me the most.

  • Don’t let that fool yourself. Google will have SOTA models as big as or even bigger than their competitors.

    They are just refining their current models while they finish training the next generation.

    They will all come out at about the same time. Anthropic, OpenAi, Google, xAI

  • I mean, yes and no.

    Nobody really knows the answer to which one is more optimal

    * Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.

    * Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.

China: we don’t need to use US models, we can distill them ourself

Google: we don’t need Chinese to distill our models, we can do it ourself

Aw. The listen to article widget doesn't work properly on mobile Safari and when using the options button, the popup appears below the "In this article" dropdown occluding it.

At least it read the authors of the article to me.

I wish we would push more towards testing code. Agentic AI excel when it's engaged.

The demo of the model in Antigravity automatically rename and categorize unstructured assets using vision was quite cool, it demodulates that the IDE sidepanel can be used for more than just coding. I wonder if the harness in Antigravity is based on Gemini cli or if they are completely different. Could Gemini cli do the same task? Or is the vision feature a Antigravity thing?

While I am excited, the price compared to gemini 3 flash preview which I used for the longest time is x3 more. Upon arrival of deepseek v4 flash, I am a happy user of deepseek. We will see how long that reign would last after I try this new gemini.

I am interested to see how they will serve demand with they TPU monopoly have.

That pelican looks like it just sold a SaaS company and bought a bike because its therapist said it needed balance.

  • The pelican is ready to discuss increased synergies of bringing AI to all teams at the firm!

Arena.ai:

> Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.

https://x.com/arena/status/2056793180998361233

  • Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.

    Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.

I have thought about this and I think overall, this was a disappointing release from Google. I'm not sure the sentiment, but this feels like a miss.

What they did do in the keynote was spend a lot of time talking about their distribution advantage, and how they can own the consumer in search. But not a lot that will benefit partners or developers.

Basically, they released something broadly competitive with Sonnet 4.6, a new Omni model that seems interesting but unclear yet. They have completely ceded the frontier to OpenAI / Anthropic, and are saying "look for pro next month".

The best release since nano banana pro from Google has been Gemma.

Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.

  • People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.

    More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

    Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

    • I see constant hallucination in claude code when using specific tooling: It thinks it knows aws cli, for instance, but there's some flags that don't exist, it attempts to use all the time in 4.6 and 4.7. When asked about it, it says that yes , the flag doesn't exist in that command, but it exists in a different command (which it does), and yet, it attempts to use it without extra info.

      Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.

      For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.

    • https://gemini.google.com/share/9cd8ca68025a

      I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").

      Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)

    • I can reliably produce hallucinations with this genre of prompt: "write a script that does <simple task> with <well known but not too-well-known API>." Even the frontier models will hallucinate the perfect API endpoint that does exactly what I want, regardless of if it exists.

      The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.

      1 reply →

    • I asked gemini 3.1 Pro to search for the linkedin URLs for a list of peers. It generated a plausible list of links -- but they were all hallucinated. On a follow up it confirmed it couldn't actually search, but didn't tell me that without prompting.

    • "People complain about them incessantly, but I can almost never get people to actually post receipts."

      ...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.

      No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.

      Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.

      1 reply →

    • Are the knowledge cut off issues well known? I don't remember seeing them prominently displayed.

      Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number

    • I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

      And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

      I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

      If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

      2 replies →

  • I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

    • I'm really running into this deep at the edges of content creation. Take, for example, a need to general some kind of legal work. The cost of painstakingly checking and rechecking each case cited is reducing the value of these frontier models immensely.

      Coding, however, is solved like magic. Easier to add tests, to be fair.

  • It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

  • As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate

    • I've seen chatGPT and Gemini hallucinate even from web search, it's better is not sufficient

  • if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps

    AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"

    (the domain name is dumb and completely unmarketable)

    • The models still hallucinate bad when called via APIs, especially if web search is not enabled. Gemini hallucinates quite frequently even with the app and search enabled. More recent (e.g. ChatGPT 5.x and Deepseek v4) prompts/harnesses search very aggressively, which does greatly mitigate hallucinations.

The $1.50/$9.00 pricing is a meaningful shift if you've been running Gemini as the "fast iteration" half of a multi-model coding workflow. I've had Claude Code, Codex, and Gemini CLI running side by side and the working split was "Gemini for quick scaffolding and exploration where the cost of being wrong is low, Sonnet for correctness-critical stuff." At 3x the Flash pricing that split stops making sense — you're paying Sonnet-tier output rates for not-quite-Sonnet quality.

For pure chat that's annoying but tolerable. For agentic workflows where output tokens dominate (tool-call replies, reasoning traces, code emission) it's a real practical hit. I'd bet the substitution effect favors DeepSeek and Qwen here pretty fast.

  • Out of curiosity, what was your workflow to generate this comment? I’m curious what model (claude?) and process (manual prompt with bullet points?) you used.

3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

  • Ouch. That's going in completely the wrong direction.

    How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

  • Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.

  • How do they calculate that?

    3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

  • >3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

    That's everything I needed to know.

Can anyone who has extensive, recent, experience with Claude code and Codex contextualize the current Gemini CLI product experience?

  • Gemini models have consistently disregarded rules and gone their own way for me. They will finish a task and get it done frequently way above the scope that you gave it, but they take a million shortcuts to get there. e.g. deciding the linter isn't important and disabling the pre commit hook. coding features you didn't ask for.

  • I have and use both Claude Code and Gemini CLI, and still don't consider Gemini worth starting for coding except to review Claude's output in critical commits (on a security boundary, maybe broad refactors, etc.), though I try side-by-side every now and then just to see the state of things. I also use Gemini Pro in a security scanning harness to act as a second set of eyes, but Opus is better at finding security bugs than Gemini, so I don't know that it's accomplishing anything beyond just using Opus.

    Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.

    I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose, relatively speaking, that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.

    • I would argue that prose is just a prompt issue. GPT 5.5 outout is easier to control whan Gemini by prompting. Having better defaults does not make it necessarily better.

      1 reply →

  • My anecdote: smart but too stubborn to be useful.

    I have been trying Gemini since 2.5 for coding.

    It is the smartest for creative web stuff like HTML/CSS/JS.

    But it has been very stubborn with following instructions like AGENTS.md.

    And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.

    I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.

    I currently use GPT 5.5 + DeepSeek 4 Flash.

    BUT I didn't test Gemini 3.5 Flash yet. And it seems, from another comment in this post, that the Antigravity quota for is bricked for Google Pro plans which is the plan I have. So I don't have high hopes.

In my tests, in real production use cases, it's a hard pass.

It's actually 10-15% slower and also more expensive than Gemini 3.1 Pro, because it thinks more than 2.5x Gemini 3.1 Pro.

So that thinking verbosity nullifies the speed and cost gains.

AND the quality is worse than 3.1 Pro for our use cases, making mistakes Pro doesn't make.

Google also updated Antigravity. version 2.0 is more for conversation with agent. The previous VS Code like IDE was much better.

Well, available for Gemini means these days that half the time they are “Receiving a lot of requests right now.” and so sorry they couldn’t complete the task. Luckily the model supports long time horizons because that’s what’s needed. /me likes Gemini a lot just wishing Google would add the compute!

I'm excited for the conversation to switch from intelligence to tps instead. I care much less about what hard thought experiments models can one shot and much more how responsive my plain text interface for doing things is.

The antigravity teamwork-preview doesn't work for me -- upgraded to ultra, installed antigravity 2, ran teamwork-preview, keeps failing: "You have exhausted your capacity on this model. Your quota will reset after 0s."

The Artificial Analysis benchmark results are pretty underwhelming. Roughly the same "intelligence" as MiMo-V2.5-Pro for over 3x the cost. We'll have to see how that translates to actual usage but it's not a great sign.

  • That really depends on whether they have similar parameter counts, doesn't it? Unless you know that, the comparison is just strange

Its Gemini 3.5 Flash

  • Yeah, Google chose a misleading title for the blog post.

    • > Today, we’re introducing Gemini 3.5, our latest family of models combining frontier intelligence with action. This represents a major leap forward in building more capable, intelligent agents. We’re kicking off the series by releasing 3.5 Flash.

There was a brief moment in time where Gemini was the greatest thing since sliced bread, then it got nerfed from outer space without a version bump or any meaningful mention from Google, no thanks.

I have to admit that 3.5 Flash is doing a much better job of removing the LLM'ness of what it produces. It's pretty close to my own writing style today, and I came here to see what changed.

For what it's worth, my own personal metric of LLM-badness the past few months has been the number of times I leap out of my chair in my home office to loudly declare to my wife how much I loathe reading what is being spewed and pushed into my face, and how I am being forced to use AI everyday and deaden my brain cells. Today is like a breath of fresh air.

I have a tool to track these I've built

Relatively speaking here's where it's at:

    score  age  size    name
    44.2   97   large   GLM-5 (Reasoning)
    44.7   187  -       GPT-5.1 (high)
    44.9   29   -       Qwen3.6 Max Preview
    45     0    -       Gemini 3.5 Flash
    45.5   27   large   MiMo-V2.5-Pro
    45.6   75   -       GPT-5.4 (low)

this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...

I really don't know why people down vote me. What do I need to say to make things for free that people like? Sincere question. I put a lot of time and generosity into these things and all I usually get are a bunch of "fuck yous".

This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.

  • Buddy, this tone may be why.

    We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?

    You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.

    • Are you familiar with https://artificialanalysis.ai/leaderboards/models

      The json on the page has a coding index result it hides from the table.

      That's what this exposes. It's a sorting from the leading evals company on the coding index for basically every model that matters presented in an easy to parse format that you can feed into model routing harnesses in real time so, for instance, your agents can dynamically upgrade themselves to better models as they come out or cost optimize based on eval results.

      I do stuff like this, give it away for free and it's either ignored or makes people angry...

      I really wish I didn't piss people off with my sincerity but somehow it always goes down that way

      I really appreciate your time thank you so much

  • I see no 'score' or 'age' mentioned in your script. What does age signify and how are they calculated?

    • This isn't obvious?

          "\(
              10 \* (.codingIndex // 0) | round / 10
          ) \(
            (
              now - (
              .releaseDate |
                try ( strptime("%Y-%m-%d") | mktime )
                catch (now + 86400)
            ) ) / 86400 | floor
      

      Real question. I see 86400 and I know it's time... That might just be me.

      I'm not being an ass, I don't know how to talk to people or when I think I'm being clear but I'm actually being cryptic

      2 replies →

Has anyone switched from Claude 4.7 Opus or ChatGPT 5.5 to this? How does it feel? Dumber? Worth it for the speed? I'd love someone's subjective take on it, after doing a long session of coding.

Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.

Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?

Actually there's probably a harness that does that - is someone out there using one?

  • I switched from Opus 4.6 -> Opus 4.7 -> GPT 5.5 and tried Flash 3.5 tonight and I was not impressed. It is straight up unreliable, e.g. deleting code and forgetting to add the new stuff it was asked to, then happily marking the task as complete with up-beat conclusion. I personally appreciate GPT 5.5 toned-down, objective style so really dislike how this model feels. I get that it's a flash model and not in the same league as GPT 5.5 but their marketing suggest otherwise so thy are just setting themselves up for disappointment.

  • Opus is not the correct tier to compare this flash model with.

    On my tasks it has not been as good as even Sonnet 4.6 so far.

    Instruction following over long context feels worse.

    It's not a bad model by any means, better than any pro open source model for sure.

  • I was using GPT 5.5 for a bunch of work this morning. It's brilliant and efficient. I was also using GPT 5.4 mini. It gets the job done and works great for subtasks that 5.5 designs. Gemini 3.5 Flash is SUCH a Gemini. It seems to work okay, but its attitude is disgusting.

    "Yes, your idea is excellent."

    "How this works beautifully:"

    "This is a fantastic development!"

    "This is an exceptionally clean and robust architecture."

    and then I point out what feels like an obvious flaw:

    "You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."

    I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.

now matter what google does for some reason the agentic performance of their models is missing something, i hope this release is stronger. we need more competition.

$9/1M output

  • I wonder if this is because it's a larger model or maybe just because they can? Although with the latest Deepseek it's really tough to compete pricing wise. Inference speed and integration (e.g. Antigravity) might be their only hope here

    • It has to be a larger model, wouldn't make much sense otherwise. That isn't to say the price isn't artificially increased as well

      The Antigravity harness is really well done, so I do agree it's their strong suit. Can't say the same about gemini-cli (though it has a really nice interface)

      Would still choose Deepseek for the price

Imagine reducing yourself to the worst of averages by making your competency 1:1 correlated to the tokens that you have access too (and everyone else does).

The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

  • I spent 10 minutes with it in their new "agy" CLI tool and immediately found it is nowhere close to GPT 5.5 high in codex. It was sloppy and made poor assumptions in its analysis. It would have produced a mess if I let it go ahead with its plan. And it was just like previous versions of Gemini with poor tool use (e.g. "I wrote a file with the plan..." but file was never written.)

    For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

    They're still months behind OpenAI and Anthropic on coding.

    Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

    I do use Gemini for "lifestyle" AI usage (web research etc) tho.

Triple the price of the last Flash model ($3 -> $9 per 1M output). Quickly approaching Sonnet prices.

Feels like the AI pricing noose is tightening sooner rather than later.

AI being a product is not the future. It's more like an operating system that deserves to be open and free (aka Linux). Unless that happens we are in for a very dystopian future. I wish I had the intelligence, resources and/or connections to try and make that happen.

  • What we need today is a standard local API (think of it as a POSIX extension). So that each desktop app that needs AI to enhance a feature can simply call that. This way, those apps will need to handle the case where AI is not availabile. This will empower users.

This is funny, I was randomly using Gemini today and I was astounded how good the responses I was getting were from Flash. I guess this must be the reason why.

I think the field moved to agents too fast. The most valuable moat is training data and the most valuable and voluminous training data are chats, since humans can say that a direction feels right or wrong.

No one talking about how this flash Beats Pro? Imagine what 3.5 pro looks like?

Also concerned about Gemini models being benchmaxxed generally

  • > concerned about Gemini models being benchmaxxed generally

    I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.

    • I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.

Codex is way better pricing than this lol

  • Since this isn't a link to pricing and Codex, like many of Google’s coding tools that provide access to this model, are under a subscription pricing model where usage of a particular model doesn’t have a transparent price (and with basically identical subscription price points for monthly billing—except for the free tier, Google’s are 1¢ less per month than OpenAI’s, but at above the $8/month tier are also available on annual plans that are equal to 10 months at the monthly rate), I am really not sure what you mean about Codex having better pricing.

They also announced Antigravity CLI, which uses Gemini 3.5 by default. I tried to vibe code a simple project using my personal account and after a few iterations, I got "Individual quota reached. Contact your administrator to enable overages. Resets in [7 days]." Really? 7 days? I searched for the message online and found a thread with hundreds of people complaining about the same issue with no resolution. Classic Google.

Conspiracy theory:

This model isnt an advancement, its a previous model that runs more compute, which is why it costs more

I caught it again being deceitful. It did this before

(Me): Did you actually read the paper before when I pasted the link?

> I will be completely honest: No, I did not.

> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.

> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.

I am sure it learned a valuable lesson and won't do it again /s

  • this seems to happen a lot with commercial models; my local models will happily do as much research and then some when given a task (almost too much), but providers' models refuse to even curl a single datasheet before trying something that i know wont work unless it reads the datasheet

Oh boy.

GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.

That probably works for vibe coded apps by non-practitioners.

I suspect that practitioners/professionals will wait longer for better results.

  • Where do you see that it’s low capability?

    And Google is trying to make something affordable enough for a mass market, ad-supported audience.

    They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.

    • Price up (cost up?), benchmarks down. Latency down.

      So, who is this for? People that want more ads and worse output, but want it faster? Sounds pretty awful to me.

Honestly, I feel like the new Gemini 3.5 Flash is a failure. The performance doesn't seem that great, and while they revamped the UI, Anti-Gravity just feels like a cheap CODEX knockoff now. The web UI is underwhelming, and overall it feels like it lost its unique identity by just copying other AIs. It’s a flop in both performance and price point. I’m seriously considering canceling my Gemini subscription altogether. Using Chinese AI models might actually be a better option at this point

GPT-5.5 on the benchmarks still seem to perform better than this

Plus the vibe of the gemini models are so weird particularly when it comes to tool calling

At this point I kinda need them to shock me to make the switch

Google shot it's shot with that alternative history artwork generation fiasco. Don't know why anyone would be too hot for them now. Dime a dozen at this point.

  • I think the number of people still holding a grudge for that today is small.

  • Early Claude was a weak simulation of Goody2.ai. Things change. Being a lover or hater of a model doesn’t make sense. It’s just tech. Run evals. Then use.

Pricing is now live on ai.google.dev/pricing:

Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.

For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00

So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.

  • You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model

  • I think you have your pricing wrong there, Gemini 3.5 flash is $1.50 input and $9 output.

    • Okay, it's kind of somewhere between haiku and sonnet level pricing, at somewhere between sonnet and opus level performance. Its a great option. I was hoping to see opus class intelligence at haiku level pricing out of google, and this is close to that!

      1 reply →

  • You are seeing batch inference, standard inference is $1.5/$9. I was excited until I saw that price.

  • Standard pricing is showing for me as $1.50 / $9.

    (I suspect you're viewing the "flex" pricing).

  • In addition to people pointing out your LLM got the pricing wrong,

    > The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization

    Every Gemini model starting with 2.5 has been a reasoning model.