I started doing some experimentation with this new Deep Think agent, and after five prompts I reached my daily usage limit. For $250 USD/mo that’s what you’ll be getting folks.
It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy. Anecdotally (from my experience) this was the one feature that enthusiasts in the AI community were interested in to justify the exorbitant price of Google’s Ultra subscription. I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
Performance-wise. So far, I couldn’t even tell. I provided it with a challenging organizational problem that my business was facing, with the relevant context, and it proposed a lucid and well-thought-out solution that was consistent with our internal discussions on the matter. But o3 came to an equally effective conclusion for a fraction of the cost, even if it was less “cohesive” of a report. I guess I’ll have to wait until tomorrow to learn more.
They might not have been ready/optimized for production, but still wanted to release it before Aug 2 EU AI Act, this way they have 2 years for compliance. So the strategy with aggressively rate-limit for few users make sense.
Several years ago I thought a good litmus test for mastery of coding is not finding a solution using internet search nor getting well written questions about esoteric coding problems answered on StackOverflow. For a while, I would post a question and answer my own question after I solved the problem for posterity (or AI bots). I always loved getting the "I've been working on this for 3 days and you saved my life" comments.
I've been working on a challenging problem all this week and all the AI copilot models are worthless helping me. Mastery in coding is being alone when nobody else nor AI copilots can help you and you have dig deep into generalization, synthesis, and creativity.
(I thought to myself, at least it will be a little while longer before I'm replaced with AI coding agents.)
Your post misses the fact that 99% of programming is repetitive plumbing and that the overwhelming majority of developers, even ivy league graduates, suck at coding and problem solving.
Thus, AI is a great productivity tool if you know how to use it for the overwhelming majority of problems out there. And it's a boost even for those that are not even good at the craft as well.
This whole narrative of "okay but it can't replace me in this or that situation" is honestly between an obvious touche (why would you think AI would replace rather than empower those who know their craft) and stale luddism.
They're remarkably useless on stuff they've seen but not had up-weighted in the training set. Even the best ones (Opus 4 running hot, Qwen and K2 will surprise you fairly often) are a net liability in some obscure thing.
Probably the starkest example of this is build system stuff: it's really obvious which ones have seen a bunch of `nixpkgs`, and even the best ones seem to really struggle with Bazel and sometimes CMake!
The absolute prestige high-end ones running flat out burning 100+ dollars a day and it's a lift on pre-SEO Google/SO I think... but it's not like a blowout vs. a working search index. Back when all the source, all the docs, and all the troubleshooting for any topic on the whole Internet were all above the fold on Google? It was kinda like this: type a question in the magic box and working-ish code pops out. Same at a glory-days FAANG with the internal mega-grep.
I think there's a whole cohort or two who think that "type in the magic box and code comes out" is new. It's not new, we just didn't have it for 5-10 years.
I have similar issues with support form companies that heavily push AI and self-serve models and make human support hard. I'm very accomplished and highly capable. If I feel the need to turn to support, the chances the solution is in a KB is very slim, same with AI. It'll be a very specific situation with a very specific need.
This has been my thought for a long time - unless there is some breakthrough in AI algo I feel like we are going to hit a "creativity wall" for coding (and some other tasks).
> It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
It's interesting how multi-dimensional LLM capabilities have proven to be.
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
It's not particularly interesting if Deep Mind comes to the same (correct) conclusion on a single problem as o3 but costs more. You could ask gpt 2.5 and gpt4 what 1+1= and would get the same response with gpt 4 costing more, but this doesn't tell us much about model capability or value.
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
> I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
I agree that’s not a good posture, but it is entirely unsurprising.
Google is probably not profiting from AI Ultra customers either, and grabbing all that sweet usage data from the free tier of AI Studio is what matters most to improve their models.
Giving free access to the best models allows Google to capture market share among the most demanding users, which are precisely the ones that will be charged more in the future. In a certain sense, it’s a great way for Google to use its huge idle server capacity nowadays.
I'm burning well over 10 millions tokens a day on free tier. 99% of the input is freely availzble data, the rest is useless. I never provided any feedback. Sure there is some telemetry, they can have it.
I doubt I'm an isolated case. This Gemini gig will cost Google a lot, they pushed it on all android phones around the globe. I can't wait to see what happens when they have to admit that not many people will pay over 20 bucks for "Ai", and I would pay well over 20 bucks just to see the face of the c suite next year when one dares to explain in simple terms there is absolutely no way to recoup the DC investment and that powering the whole thing will cost the company 10 times that.
Similar complaints are happening all over reddit with the Claude Code $200/mo plan and Cursor. The companies with deep VC funding have been subsidizing usage for a year now, but we're starting to see that bleed off.
I think the primary concern of this industry right now is how, relative to the current latest generation models, we simultaneously need intelligence to increase, cost to decrease, effective context windows to increase, and token bandwidths to increase. All four of these things are real bottlenecks to unlocking the "next level" of these tools for software engineering usage.
Google isn't going to make billions on solving advanced math exams.
Agreed, and big context windows are key to mass adoption in wider use cases beyond chatbots (random ex: in knowledge management apps, being able to parse the entire note library/section and hook it into global AI search), but those use cases are decidedly not areas where $200 per month subscriptions can work.
I'll hazard to say that cost and context windows are the two key metrics to bridge that chasm with acceptable results.... As for software engineering though, that cohort will be demanding on all front for the foreseeable future, especially because there's a bit of a competitive element. Nobody wants to be the vibecoder using sub-par tools compared to everyone else showing off their GitHub results and making sexy blog posts about it on HN.
> Similar complaints are happening all over reddit with the Claude Code $200/mo
I would imagine 95% of people never get anywhere near to hitting their CC usage. The people who are getting rate-limited have ten windows open, are auto-accepting edits, and YOLO'ing any kind of coherent code quality in their codebase.
Model routing is deceptively hard though. It has halting problem characteristics: often only the smartest model is smart enough to accurately determine a task's difficulty. And if you need the smartest model to reliably classify the prompt, it's cheaper to just let it handle the prompt directly.
This is why model pickers persist despite no one liking them.
"I'm sorry but that wasn't a very interesting question you just asked. I'll spare you the credit and have a cheaper model answer that for you for free. Come back when you have something actually challenging."
It doesn't, it's not "1000 Gemini Pro" requests for free, Google misled everyone. It's 1000 Gemini requests, Flash included. You get like 5-7 Gemini Pro requests before you get limited.
I've found the free version swaps away from pro incredibly fast. But our company has gemini but can't even get that - we were being asked to do everything by API key.
They're using it as a major inducement to upgrade to AI Ultra. I mean, the image and video stuff is neat, but adds no value for the vast majority of AI subscribers, so right now this is the most notable benefit of paying 12x more.
FWIW, Google seems to be having some severe issues with oddball, perhaps malfunctioning quota systems. I'm regularly finding extraordinarily little use of gemini-cli is hitting the purported 1000 request limit, when in reality I've done less than 10.
I'm not in the AI sceptic camp (LLMs can be useful for some tasks, and I use them often), but this is the big issue at the moment.
In order for agentic AI to replace (for example) a software engineer, we need a big step up in capability, around an order of magnitude. These chain of thought models do get a bit closer to that, although in my opinion we're still a way away.
However, at the same time we need about an order of magnitude decrease in price. These models are expensive even at the current price tokens are sold at which seems to be below the actual cost. And these massive CoT models are taking us in completely the wrong direction in terms of cost
I'd be interested in tests involving tasks with large amounts of context. Parallel thinking could conceivably useful for a variety of specific problem types. Having more context than any specific chain of thought can reasonably attend to might be one of them.
I have ultra. Will not be renewing it. Useless, at least have global limits and let people decide how they want to use it. If I have tokens left, why can't I use it for code assist?
it turns out that AI at this level is very expensive to run (capex, energy).
my bet is that AI itself won't figure out how to overcome these constraints and reach escape velocity.
Perhaps this will be the incentive to finally get fusion working. Big tech megacorps are flush with cash and could fund this research many times over at current rates. E.g. NIF is several billion dollars; Google alone has almost $100B in the bank.
Mainframes are the only viable way to build computers. Micro processors will never figure out how to get small and fast enough for personal computers to reach escape velocity.
our minds are incredibly energy efficient, that leads me to believe it is possible to figure out, but it might be a human rather than an AI that gives us something more akin to a biological solution.
Gemini is consistently the only model that can reason over long context in dynamic domains for me. Deep Think just did that reviewing an insane amount of Claude Code logs - for a meta analysis task of the underlying implementation. Laughable to think Grok could do that.
Honestly the first one where I would have guessed "this is a pelican riding a bicycle" if presented with just the image and 0 other context. This and the voxel tower are fairly impressive - we're seeing some semblance of visual / spatial understanding with this model.
You can spin up a version of this at home using
simonw's LLM cli with the llm-consortium plugin.
Bonus 1: Use any combination of models. Mix n match models from any lab.
Bonus 2: Serve your custom consortium on a local API from a single command using the llm-model-gateway plugin and use it in your apps and coding assistants.
1. Why do you say this is a version of Gemini deep think? It seems like there could be multiple ways to build a multiagent model to explore a space.
2. The covariance between models leads to correlated errors, lowering the individual effectiveness of each contributing model. It would seem to me that you'd want to find a set of model architectures/prompt_congigs that minimizes covariance while maintaining individual accuracy, on a benchmark set of problems that have multiple provable solutions (i.e. not one path to a solution that is objectively correct).
I didn't mean to suggest it's a clone of Deep Think, which is proprietary. I meant that it's a version of parallel reasoning. Got the idea from Karpathy's tweet in December and built it. Then DeepMind published the "Evolving Deeper LLM Thinking" paper in January with similar concepts. Great minds, I guess?
https://arxiv.org/html/2501.09891v1
2. The correlated errors thing is real, though I'd argue it's not always a dealbreaker. Sometimes you want similar models for consistency, sometimes you want diversity for coverage. The plugin lets you do either - mix Claude with kimi and Qwen if you want, or run 5 instances of the same model. The "right" approach probably depends on your use case.
You can use this with openwebui already. Just llm install llm-model-gateway. Then after you save a consortium you run llm serve --host 0.0.0.0
This will give you a openai compatible endpoint which you add to your chat client.
Approach is analogous to Grok 4 Heavy: use multiple "reasoning" agents in parallel and then compare answers before coming back with a single response, taking ~30 minutes. Great results, though it would be more fair for the benchmark comparisons to be against Grok 4 Heavy rather than Grok 4 (the fast, single-agent model).
Yeah the general “discovery” is that using the same reasoning compute effort, but spreading them over multiple different agents generally leads to better results.
It solves the “longer thinking leads to worse results” problem by approaching multiple paths of thinking in parallel, but just not think as long.
> Yeah the general “discovery” is that using the same reasoning compute effort, but spreading them over multiple different agents generally leads to better results.
Isn’t the compute effort N times as expensive, where N is the number of agents? Unless you meant in terms of time (and even then, I guess it’d be the slowest of the N agents).
> Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques. This approach lets Gemini generate many ideas at once and consider them simultaneously, even revising or combining different ideas over time, before arriving at the best answer.
This doesn't exclude the possibility of using multiple agents in parallel, but to me it doesn't necessarily mean that this is what's happening, either.
That this kind of approach works is good news for local LLM enthusiasts, as it makes Cloud LLM using this more expensive while local LLM can do so for free up to a point (because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one. Until you become compute-bound of course).
> because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one
I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.
Dumb (?) question but how is Google's approach here different than Mixture of Experts? Where instead of training different experts to have different model weights you just count on temperature to provide diversity of thought. How much benefit is there in getting the diversity of thought in different runs of the same model versus running a consortium of different model weights and architectures? Is there a paper contrasting results given fixed computation between spending that compute on multiple runs of the same model vs different models?
MOE is just a way to add more parameters/capacity to a model without making it less efficient to run, since it's done in a way that not all parameters are used for each token passing through the model. The name MOE is a bit misleading since the "experts" are just alternate paths though part of the model, not having any distinct expertise in the way the name might suggest.
Just running the model multiple times on the same input and selecting the best response (according to some judgement) seems a bit of a haphazard way of getting much diversity of response, if that is really all it is doing.
There are multiple alternate approaches to sampling different responses from the model that come to mind, such as:
1) "Tree of thoughts" - generate a partial response (e.g. one token, or one reasoning step), then generate branching continuations of each of those, etc, etc. Compute would go up exponentially according to number of chained steps, unless heavy pruning is done similar to how it is done for MCTS.
2) Separate response planning/brainstorming from response generation by first using a "tree of thoughts" like process just to generate some shallow (e.g. depth < 3) alternate approaches, then use each of those approaches as additional context to generate one or more actual responses (to then evaluate and choose from). Hopefully this would result in some high level variety of response without the cost of of just generating a bunch of responses and hoping that they are usefully diverse.
Mixture of Experts isn't using multiple models with different specialties, it's more like a sparsity technique, where you massively increase the number of parameters and use only a subset of the weights in each forward pass.
I am surprised such a simple approach has taken so long to be actually used. My first image description cli attempt did basically that: Use n to get several answers and another pass to summarize.
People have played with (multi-) agentic frameworks for LLMs from the very beginning but it seems like only now with powerful reasoning models it is really making a difference.
It's very resource intensive so maybe they had to wait until processes got more efficient? I can also imagine they would want to try and solve it in a... better way before doing this.
I have a similar thing built around a year ago w/ autogen. The difference now is models can really be steered towards "part" of the overall goal, and they actually follow that.
Before this, even the best "math" models were RLd to death to only solve problems. If you wanted it to explore "method_a" of solving a problem you'd be SoL. The model would start like "ok, the user wants me to explore method_a, so here's the solution: blablabla doing whatever it wanted, unrelated to method_a.
Similar things for gathering multiple sources. Only recently can models actually pick the best thing out of many instances, and work effectively at large context lengths. The previous tries with 1M context lengths were at best gimmicks, IMO. Gemini 2.5 seems the first model that can actually do useful stuff after 100-200k tokens.
I find it interesting, how OpenAI came out with a $200 plan, Anthropic did $100 and $200, then Gemini ups it to $250, and now Grok is at $300.
OpenAI is the only one that says "practically unlimited" and I have never hit any limit on my ChatGPT Pro plan. I hit limits on Claude Max (both plans) several times.
Why are these companies not upfront about what the limits are?
Because they want to have their cake and eat it too.
A fair pricing model would be token-based, so that a user can see for each query how much they cost, and only pay for what they actually used. But AI companies want a steady stream of income, and they want users to pay as much as possible, while using as little as possible. Therefore they ask for a monthly or even yearly price with an unknown number of tokens included, such that you will always pay more then with token-based payments.
Personally, I prefer having a fixed, predictable, price rather than paying for usage. There is something psychologically nicer about it to me, and I find myself rationing my usage more when I am using the API (which is effectively what you describe already, just minus the UI).
I don't think it's that, I think they just want people to onboard onto these things before understanding what the actual cost might be once they're not subsidized by megacorps anymore. Something similar to loss-leading endeavors like Uber and Lyft in the 2010s, I suspect that that showing the actual cost of inference would raise questions about the cost effectiveness of these things for a lot of applications. Internally, Google's data query surface tell you cost in terms of SWE-time (e.g. this query cost 1 SWE hour) since the incentives are different.
> Why are these companies not upfront about what the limits are?
Most likely because they reserve the right to dynamically alter the limits in response to market demands or infrastructure changes.
See, for instance, the Ghibli craze that dominated ChatGPT a few months ago. At the time OpenAI had no choice but to severely limit image generation quotas, yet today there are fewer constraints.
Per-usage pricing discourages use which limits how critical a service can be to your life or workflow. These companies want you to rely on the service such that you’ll pay the price. One customer might use it once a day and find the price reasonable; another may use it 10 times a day and still find the price reasonable. This kind of broad pricing allows for this variation.
Because if you are transparent about the limits, more people will start to game the limits, which leads to lower limits for everyone – which is a worse outcome for almost everyone.
tldr: We can't have nice things, because we are assholes.
Been using Gemini for a few months, somehow it's gotten much, much worse in that time. Hallucinations are very common, and it will argue with you when you point it out. So, don't have much confidence.
In my experience with chat, Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.
Pro is frustrating because it too often won't search to find current information, and just gives stale results from before its training cutoff. Flash doesn't do this much anymore.
For coding I use Pro in Gemini CLI. It is amazing at coding, but I'm actually using it more to write design docs, decomp multi-week assignments down to daily and hourly tasks, and then feed those docs back to Gemini CLI to have it work through each task sequentially.
With a little structure like this, it can basically write its own context.
I like flash because when it's wrong it's wrong very quickly. You can either change the prompt or just solve the problem yourself. It works well for people who can spot the answer as being "wrong"
interesting
out of all "thinking models," I struggle with Gemini the most for coding. Just can't make it perform. I feel like they silently nerfed it over the last months.
Yeah, it's hard to measure. Not sure about our expectations, though I recall way better output when I first started using Gemini 2.5 vs now. It seems to be stupider and more headstrong somehow?
my recent experience with flash and using it to prototype a c++ header i was developing:
- it was great to brainstorm with but it routinely introduced edits and dramatic code changes, often unnecessary and many times causing regressions to existing, tested code.
- numerous times recursion got introduced to revisions without being prompted or without any justified or good reason
- hallucinated a few times regarding c++ type deduction semantics
i eventually had to explicitly tell it to not introduce edits in any working code being iterated on without first discussing the changes, and then being prompted by me to introduce the edits.
all in all i found base chatgpt a lot more productive and accurate and ergonomic for iterating (on the same problem just working it in parallel with gemini).
- code changes were not always arbitrarily introduced or dramatic
- it attempted to always work with the given code rather than extrapolate and mind read
- hallucinated on some things but quickly corrected and moved forward
- was a lot more interactive and documenting
- almost always prompted me first before introducing a change (after providing annotated snippets and documentation as the basis for a proposed change or fix)
however, both were great tools to work with when it came to cleaning up or debugging existing code, especially unit testing or anything related to TDD
Same here. I stopped using Gemini Pro because on top of it's hard to follow verbosity it was giving contradicting answers. Things that Claude Sonnet 4 could answer.
Speaking of Sonnet, I feel like it's closing the gap to Opus. After the new quotas I started to try it before Opus and now it gets complex things right more often than not. This wasn't my experience just a couple of months ago.
Via the chat prompt mostly, and sometimes via Copilot. It was quoting me sources and links that didn't exist, and when I told it the links were wrong it doubled down forever, no matter how hard I tried to tell it otherwise. Even sent screenshots, etc.
Kinda just got stuck in a self-confident loop that time. Other times the output is just far worse than Claude for similar use cases, where a couple months back it was stronger, at least in my subjective experience.
I can’t even convince Gemini CLI while planning things to not go off and make a bunch of random changes on its own, even after being very clear not to do so, intercepting to tell it to stop doing that, then it just continues on fucking everything up.
Claude Code gets the most out of Anthropic’s models, that’s why people love it.
Conversely, Gemini CLI makes Gemini Pro 2.5 less capable than the model itself actual is.
It’s such a stark difference I’ve given up using Gemini CLI even with it being free, but still use it for situations amenable to a prompt interface on a regular basis. It’s a very strong model.
That's my experience too, when I give Gemini CLI a big, general task and just let it run.
But if I give it structure so it can write its own context, it is truly astonishing.
I'll describe my big, general task and tell it to first read the codebase and then write a detailed requirements document, and not to change any code.
Then I'll tell it to read the codebase and the detailed requirements document it just wrote, and then write a detailed technical spec with API endpoints, params, pseudocode for tricky logic, etc.
Then I'll tell it to read the codebase, and the requirements document it just wrote, and the tech spec it just wrote, and decomp the whole development effort into weekly, daily and hourly tasks to assign to developers and save that in a dev plan document.
Only then is it ready to write code.
And I tell it to read the code base, requirements, tech spec and dev plan, all of which it authored, and implement Phase 1 of the dev plan.
It's not all mechanical and deterministic, or I could just script the whole process. Just like with a team of junior devs, I still need to review each document it writes, tweak things I don't like, or give it a better prompt to reflect my priorities that I forgot to tell it the first time, and have it redo a document from scratch.
But it produces 90% or more of its own context. It ingests all that context that it mostly authored, and then just chugs along for a long time, rarely going off the rails anymore.
> If you’re a Google AI Ultra subscriber, you can use Deep Think in the Gemini app today with a fixed set of prompts a day by toggling “Deep Think” in the prompt bar when selecting 2.5 Pro in the model drop down.
If fixed set means fixed number it would be nice to know how many.
Otherwise i would like to know what fixed set means here.
Apparently the model will think for 30+ minutes on a given prompt. So it seems it's more for research or dense multi-faceted problems than for general coding or writing fan fic.
Asking AI to create 3D scenes like the example in the page seems like asking someone to hammer something with a screwdriver, we would need an AI compatible 3D software that either has easier to use voxels built in so it can create similar to pixel art, or easier math defined curves that can be meshed, either way AI just does not currently have the right tools to generate 3D scenes
One of the most alluring things about LLMs though is that it is like having a screwdriver that can just about work like a hammer, and draft an email to your landlord and so on
I'm wondering if 'slow AI' like this is a temporary bridge, or a whole new category we need to get used to. Is the future really about having these specialized 'deep thinkers' alongside our fast, everyday models? Or is this just a clunky V1 until the main models get this powerful on their own in seconds?
It's not unreasonable to think that with improvements on the software side - a Saturn-like model based on diffusion could be this powerful within a decade - with 1s responses.
I'd highly doubt in 10 years, people are waiting 30m for answers of this quality - either due to the software side, the hardware side, and/or scaling.
It's possible in 10 years, the cost you pay is still comparable, but I doubt the time will be 30m.
It's also possible that there's still top-tier models like this that use absurd amounts of resources (by today's standards) and take 30m - but they'd likely be at a much higher quality than today's.
The pressure in the other direction is tool use. The more a model wants to call out to a series of tools, the more the delay will be, just because it the serial process isn't part of the model.
we’re optimizing for quality over performance right now, at some point the pendulum will swing the other way, but there might be problems that require deep thinking just like we have a need for supercomputers to run jobs for days today.
This comes at a time where my experience with Gemini is lacking, it seems to get worse. It's not picking up on my intention, sometimes replies in the wrong language, etc. Either that or I am just transparent that it's a tool and its feelings are hurt. I've had to call it a moron several times, and it was funny when it started reprimanding me for my foul language once. But it was wrong. This behavior seems new. I could never trust it to not do random edits everywhere in a document, so nowadays I use it to check Claude, which can be trusted with a document.
I had a good experience with gemini-cli (which I think uses pro initially?). It's not very good, but it's very fast. So when it's wrong it's wrong very quickly and you can either solve it yourself or pivot your prompt. For a professional software engineer this actually works out OK.
I would be interested in reading about how people who are paying for access to Google's top AI plan are intending to use this. Do you have any examples of immediate use-cases that might benefit?
Is Google using this tool internally? One would expect them to give some examples of how it's helping internal teams accelerate or solve more challenging problems, if they were eating their own dogfood.
I've been using Gemini 2.5 Pro for a few months now and have found my experience to be very positive. I primarily use it for coding and through the API, and I feel it's consistently improving. I haven't yet tried Deep Think, though.
Upgraded and quickly hit my limit. And find that they have limits, I just wish that they were more transparent. Even if it's just a vague statement about limited usage. I assumed it would be similar to regular Gemini 2.5 on the pro plan but it's not
Is this a joke? I am paying $250/month because I want to use Deep Think and I literally just burned through my daily quota in 5 PROMPTS. That is insane!
would have made possible a class of jokes:
- i wonder how many iterations we need with it before succeeding
- i ran it 5 minutes and the deep thought model started to hallucinate, i hope not because oxygen deprivation
...
I’m never the one to defend AI, but what do you mean? Is it the “AI overview” that pops up on Google? Other than that, I would say Gemini is definitely less in your face than ChatGPT for example
My company uses google workspace and every google doc, spreadsheet, calendar, online meeting and search puts nonstop callouts and messages about using Gemini. It's gotten so bad that I'm about to try building a browser extension to block that bullshit. It clutters the UI and nags. If I wanted that crap, I'd turn it on.
The latest Samsung updates (and Pixel too I imagine?) baked it into the OS and made it difficult/impossible to disable. It’s probably that. I also agree, aside from that I haven’t seen anything about Gemini at all, I think their marketing is quite poor for something so important.
I started doing some experimentation with this new Deep Think agent, and after five prompts I reached my daily usage limit. For $250 USD/mo that’s what you’ll be getting folks.
It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy. Anecdotally (from my experience) this was the one feature that enthusiasts in the AI community were interested in to justify the exorbitant price of Google’s Ultra subscription. I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
Performance-wise. So far, I couldn’t even tell. I provided it with a challenging organizational problem that my business was facing, with the relevant context, and it proposed a lucid and well-thought-out solution that was consistent with our internal discussions on the matter. But o3 came to an equally effective conclusion for a fraction of the cost, even if it was less “cohesive” of a report. I guess I’ll have to wait until tomorrow to learn more.
They might not have been ready/optimized for production, but still wanted to release it before Aug 2 EU AI Act, this way they have 2 years for compliance. So the strategy with aggressively rate-limit for few users make sense.
wheee, great way to lock in incumbents even more or lock out the EU from startups
3 replies →
Several years ago I thought a good litmus test for mastery of coding is not finding a solution using internet search nor getting well written questions about esoteric coding problems answered on StackOverflow. For a while, I would post a question and answer my own question after I solved the problem for posterity (or AI bots). I always loved getting the "I've been working on this for 3 days and you saved my life" comments.
I've been working on a challenging problem all this week and all the AI copilot models are worthless helping me. Mastery in coding is being alone when nobody else nor AI copilots can help you and you have dig deep into generalization, synthesis, and creativity.
(I thought to myself, at least it will be a little while longer before I'm replaced with AI coding agents.)
Your post misses the fact that 99% of programming is repetitive plumbing and that the overwhelming majority of developers, even ivy league graduates, suck at coding and problem solving.
Thus, AI is a great productivity tool if you know how to use it for the overwhelming majority of problems out there. And it's a boost even for those that are not even good at the craft as well.
This whole narrative of "okay but it can't replace me in this or that situation" is honestly between an obvious touche (why would you think AI would replace rather than empower those who know their craft) and stale luddism.
7 replies →
They're remarkably useless on stuff they've seen but not had up-weighted in the training set. Even the best ones (Opus 4 running hot, Qwen and K2 will surprise you fairly often) are a net liability in some obscure thing.
Probably the starkest example of this is build system stuff: it's really obvious which ones have seen a bunch of `nixpkgs`, and even the best ones seem to really struggle with Bazel and sometimes CMake!
The absolute prestige high-end ones running flat out burning 100+ dollars a day and it's a lift on pre-SEO Google/SO I think... but it's not like a blowout vs. a working search index. Back when all the source, all the docs, and all the troubleshooting for any topic on the whole Internet were all above the fold on Google? It was kinda like this: type a question in the magic box and working-ish code pops out. Same at a glory-days FAANG with the internal mega-grep.
I think there's a whole cohort or two who think that "type in the magic box and code comes out" is new. It's not new, we just didn't have it for 5-10 years.
1 reply →
I have similar issues with support form companies that heavily push AI and self-serve models and make human support hard. I'm very accomplished and highly capable. If I feel the need to turn to support, the chances the solution is in a KB is very slim, same with AI. It'll be a very specific situation with a very specific need.
2 replies →
This has been my thought for a long time - unless there is some breakthrough in AI algo I feel like we are going to hit a "creativity wall" for coding (and some other tasks).
5 replies →
Curious to know what are those challenging programming problems are. Can you share some examples?
1 reply →
> It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
It's interesting how multi-dimensional LLM capabilities have proven to be.
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
1 reply →
It's just wildly inconsistent to me. Some times it'll produce a work of genius. Other times, total garbage.
5 replies →
It's not particularly interesting if Deep Mind comes to the same (correct) conclusion on a single problem as o3 but costs more. You could ask gpt 2.5 and gpt4 what 1+1= and would get the same response with gpt 4 costing more, but this doesn't tell us much about model capability or value.
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
> It would be more interesting to know if it can handle problems that o3 can't do
Suppose it can't. How will you know? All the datapoints will be "not particularly interesting".
1 reply →
> I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
I agree that’s not a good posture, but it is entirely unsurprising.
Google is probably not profiting from AI Ultra customers either, and grabbing all that sweet usage data from the free tier of AI Studio is what matters most to improve their models.
Giving free access to the best models allows Google to capture market share among the most demanding users, which are precisely the ones that will be charged more in the future. In a certain sense, it’s a great way for Google to use its huge idle server capacity nowadays.
I'm burning well over 10 millions tokens a day on free tier. 99% of the input is freely availzble data, the rest is useless. I never provided any feedback. Sure there is some telemetry, they can have it.
I doubt I'm an isolated case. This Gemini gig will cost Google a lot, they pushed it on all android phones around the globe. I can't wait to see what happens when they have to admit that not many people will pay over 20 bucks for "Ai", and I would pay well over 20 bucks just to see the face of the c suite next year when one dares to explain in simple terms there is absolutely no way to recoup the DC investment and that powering the whole thing will cost the company 10 times that.
Similar complaints are happening all over reddit with the Claude Code $200/mo plan and Cursor. The companies with deep VC funding have been subsidizing usage for a year now, but we're starting to see that bleed off.
I think the primary concern of this industry right now is how, relative to the current latest generation models, we simultaneously need intelligence to increase, cost to decrease, effective context windows to increase, and token bandwidths to increase. All four of these things are real bottlenecks to unlocking the "next level" of these tools for software engineering usage.
Google isn't going to make billions on solving advanced math exams.
Agreed, and big context windows are key to mass adoption in wider use cases beyond chatbots (random ex: in knowledge management apps, being able to parse the entire note library/section and hook it into global AI search), but those use cases are decidedly not areas where $200 per month subscriptions can work.
I'll hazard to say that cost and context windows are the two key metrics to bridge that chasm with acceptable results.... As for software engineering though, that cohort will be demanding on all front for the foreseeable future, especially because there's a bit of a competitive element. Nobody wants to be the vibecoder using sub-par tools compared to everyone else showing off their GitHub results and making sexy blog posts about it on HN.
7 replies →
> Similar complaints are happening all over reddit with the Claude Code $200/mo
I would imagine 95% of people never get anywhere near to hitting their CC usage. The people who are getting rate-limited have ten windows open, are auto-accepting edits, and YOLO'ing any kind of coherent code quality in their codebase.
It could be that your problem was too simple to justify the use of Deep Think.
But yes, Google should have figured that out and used a less expensive mode of reasoning.
Model routing is deceptively hard though. It has halting problem characteristics: often only the smartest model is smart enough to accurately determine a task's difficulty. And if you need the smartest model to reliably classify the prompt, it's cheaper to just let it handle the prompt directly.
This is why model pickers persist despite no one liking them.
12 replies →
"I'm sorry but that wasn't a very interesting question you just asked. I'll spare you the credit and have a cheaper model answer that for you for free. Come back when you have something actually challenging."
11 replies →
Interestingly Gemini CLI has a very generous free quota. Is Google's strategy just overpricing some stuff and subsidizing the underpriced stuff?
It doesn't, it's not "1000 Gemini Pro" requests for free, Google misled everyone. It's 1000 Gemini requests, Flash included. You get like 5-7 Gemini Pro requests before you get limited.
6 replies →
This is the fundamental pricing strategy of all modern software in fact.
Underpriced for consumers, overpriced for businesses.
I've found the free version swaps away from pro incredibly fast. But our company has gemini but can't even get that - we were being asked to do everything by API key.
I suspect that the main goal here was to grab the top spot in a bunch of benchmarks, and being counted as an "available" model.
They're using it as a major inducement to upgrade to AI Ultra. I mean, the image and video stuff is neat, but adds no value for the vast majority of AI subscribers, so right now this is the most notable benefit of paying 12x more.
FWIW, Google seems to be having some severe issues with oddball, perhaps malfunctioning quota systems. I'm regularly finding extraordinarily little use of gemini-cli is hitting the purported 1000 request limit, when in reality I've done less than 10.
1 reply →
I'm not in the AI sceptic camp (LLMs can be useful for some tasks, and I use them often), but this is the big issue at the moment.
In order for agentic AI to replace (for example) a software engineer, we need a big step up in capability, around an order of magnitude. These chain of thought models do get a bit closer to that, although in my opinion we're still a way away.
However, at the same time we need about an order of magnitude decrease in price. These models are expensive even at the current price tokens are sold at which seems to be below the actual cost. And these massive CoT models are taking us in completely the wrong direction in terms of cost
The part I cannot understand is why for many AI offerings, I cannot make out what each pricing tier does with a quick glance.
What happened to the simplicity of Steve Jobs' 2x2 (consumer vs.pro, laptop vs. desktop)?
The rate limits are not because of compute performance or the lack of. It's to stop people from training their own models on the very cutting edge.
What was the experimentation? Can you share with us so we can see how "bizarrely uncompetitive" it is?
Bizarrely uncompetitive is referencing the 5 uses per day not the performance itself
I'd be interested in tests involving tasks with large amounts of context. Parallel thinking could conceivably useful for a variety of specific problem types. Having more context than any specific chain of thought can reasonably attend to might be one of them.
I have ultra. Will not be renewing it. Useless, at least have global limits and let people decide how they want to use it. If I have tokens left, why can't I use it for code assist?
it turns out that AI at this level is very expensive to run (capex, energy). my bet is that AI itself won't figure out how to overcome these constraints and reach escape velocity.
Perhaps this will be the incentive to finally get fusion working. Big tech megacorps are flush with cash and could fund this research many times over at current rates. E.g. NIF is several billion dollars; Google alone has almost $100B in the bank.
Mainframes are the only viable way to build computers. Micro processors will never figure out how to get small and fast enough for personal computers to reach escape velocity.
4 replies →
> it turns out that AI at this level is very expensive to run (capex, energy)
If it's CapEx it's -- by definition -- not a cost to run. Energy costs will trend to zero.
2 replies →
our minds are incredibly energy efficient, that leads me to believe it is possible to figure out, but it might be a human rather than an AI that gives us something more akin to a biological solution.
4 replies →
Uncompetitive how, what task and Eval?
Gemini is consistently the only model that can reason over long context in dynamic domains for me. Deep Think just did that reviewing an insane amount of Claude Code logs - for a meta analysis task of the underlying implementation. Laughable to think Grok could do that.
Ladies and Gentlemen,
Here's Gemini Deep Think when prompted with:
"Create a svg of a pelican riding on a bicycle"
https://www.svgviewer.dev/s/5R5iTexQ
Beat Simon Willison to it :)
If it's on HN and is a meme at this point, it will end up in the training set.
It's kind of fun to imagine that there is an intern in every AI company furiously trying to get nice looking svg pelicans on bicycles.
OK that is recognizably a pelican, pretty great!
This feels like the best pelicanbike yet. The singularity might be closer than we imagine.
Time for a leaderboard?
7 replies →
Meme benchmarks like this and Strawberry are funny but very easy to game, I bet they're all over training sets nowadays.
If you train a model on the SVGs of pelicans on a bicycle that are out there already you're going to get a VERY weird looking pelican on a bicycle: https://simonwillison.net/tags/pelican-riding-a-bicycle/
1 reply →
Truly worth the price. We live in the future.
Honestly the first one where I would have guessed "this is a pelican riding a bicycle" if presented with just the image and 0 other context. This and the voxel tower are fairly impressive - we're seeing some semblance of visual / spatial understanding with this model.
Interestingly it seems to draw the bike's seat too (around line 34) which then gets covered by the pelican.
Easily the best one yet!
First I've seen of human quality. Maybe we are reaching API. (Artificial Pelican Intelligence)
Saw one today from gpt5 (via some api trick someone found) that was better than this, let me see if I can find it.
Pelican:
https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
Longer thread re gpt5:
https://old.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_alr...
2 replies →
Can it do circuit diagrams? Because that's one practical area where I think the AI models are lacking.
Not yet, or schemas. It can do netlists, though! But it's much harder to go from "Netlist -> Diagram/Schema" than the other way around :(
It was an expensive SVG, but it did a good job.
The bike is an actual bike with a diamond frame.
Now add irrelevant facts about cats and see if it can still draw it.
How much would it cost at list-price API pricing?
I don't have access but I wonder whether a dog on a jetski would be nearly as good
You can spin up a version of this at home using simonw's LLM cli with the llm-consortium plugin.
Bonus 1: Use any combination of models. Mix n match models from any lab.
Bonus 2: Serve your custom consortium on a local API from a single command using the llm-model-gateway plugin and use it in your apps and coding assistants.
https://x.com/karpathy/status/1870692546969735361
You can also build a consortium of consortiums like so:
Or even make the arbiter a consortium:
or go openweights only:
https://GitHub.com/irthomasthomas/llm-consortium
1. Why do you say this is a version of Gemini deep think? It seems like there could be multiple ways to build a multiagent model to explore a space. 2. The covariance between models leads to correlated errors, lowering the individual effectiveness of each contributing model. It would seem to me that you'd want to find a set of model architectures/prompt_congigs that minimizes covariance while maintaining individual accuracy, on a benchmark set of problems that have multiple provable solutions (i.e. not one path to a solution that is objectively correct).
I didn't mean to suggest it's a clone of Deep Think, which is proprietary. I meant that it's a version of parallel reasoning. Got the idea from Karpathy's tweet in December and built it. Then DeepMind published the "Evolving Deeper LLM Thinking" paper in January with similar concepts. Great minds, I guess? https://arxiv.org/html/2501.09891v1
2. The correlated errors thing is real, though I'd argue it's not always a dealbreaker. Sometimes you want similar models for consistency, sometimes you want diversity for coverage. The plugin lets you do either - mix Claude with kimi and Qwen if you want, or run 5 instances of the same model. The "right" approach probably depends on your use case.
Is the European Union a consortium of consortiums?
Thanks! Do you happen to know if there any OpenWebUI plugins similar to this?
You can use this with openwebui already. Just llm install llm-model-gateway. Then after you save a consortium you run llm serve --host 0.0.0.0 This will give you a openai compatible endpoint which you add to your chat client.
I am not seeing this llm serve command
it's a separate plugin rn. llm install llm-model-gateway
Approach is analogous to Grok 4 Heavy: use multiple "reasoning" agents in parallel and then compare answers before coming back with a single response, taking ~30 minutes. Great results, though it would be more fair for the benchmark comparisons to be against Grok 4 Heavy rather than Grok 4 (the fast, single-agent model).
Yeah the general “discovery” is that using the same reasoning compute effort, but spreading them over multiple different agents generally leads to better results.
It solves the “longer thinking leads to worse results” problem by approaching multiple paths of thinking in parallel, but just not think as long.
> Yeah the general “discovery” is that using the same reasoning compute effort, but spreading them over multiple different agents generally leads to better results.
Isn’t the compute effort N times as expensive, where N is the number of agents? Unless you meant in terms of time (and even then, I guess it’d be the slowest of the N agents).
2 replies →
What makes you sure of that? From the article,
> Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques. This approach lets Gemini generate many ideas at once and consider them simultaneously, even revising or combining different ideas over time, before arriving at the best answer.
This doesn't exclude the possibility of using multiple agents in parallel, but to me it doesn't necessarily mean that this is what's happening, either.
What could “parallel thinking techniques” entail if not “using multiple agents in parallel”?
How can it not be exactly what’s happening?
That this kind of approach works is good news for local LLM enthusiasts, as it makes Cloud LLM using this more expensive while local LLM can do so for free up to a point (because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one. Until you become compute-bound of course).
> because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one
I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.
1 reply →
Wait, how does this work? If you load in one LLM of 40 GB, then to load in four more LLMs of 40 GB still takes up an extra 160 GB of memory right?
1 reply →
Grok-4 heavy benchmarks used tools, which trivializes a lot of problems.
Dumb (?) question but how is Google's approach here different than Mixture of Experts? Where instead of training different experts to have different model weights you just count on temperature to provide diversity of thought. How much benefit is there in getting the diversity of thought in different runs of the same model versus running a consortium of different model weights and architectures? Is there a paper contrasting results given fixed computation between spending that compute on multiple runs of the same model vs different models?
MOE is just a way to add more parameters/capacity to a model without making it less efficient to run, since it's done in a way that not all parameters are used for each token passing through the model. The name MOE is a bit misleading since the "experts" are just alternate paths though part of the model, not having any distinct expertise in the way the name might suggest.
Just running the model multiple times on the same input and selecting the best response (according to some judgement) seems a bit of a haphazard way of getting much diversity of response, if that is really all it is doing.
There are multiple alternate approaches to sampling different responses from the model that come to mind, such as:
1) "Tree of thoughts" - generate a partial response (e.g. one token, or one reasoning step), then generate branching continuations of each of those, etc, etc. Compute would go up exponentially according to number of chained steps, unless heavy pruning is done similar to how it is done for MCTS.
2) Separate response planning/brainstorming from response generation by first using a "tree of thoughts" like process just to generate some shallow (e.g. depth < 3) alternate approaches, then use each of those approaches as additional context to generate one or more actual responses (to then evaluate and choose from). Hopefully this would result in some high level variety of response without the cost of of just generating a bunch of responses and hoping that they are usefully diverse.
Mixture of Experts isn't using multiple models with different specialties, it's more like a sparsity technique, where you massively increase the number of parameters and use only a subset of the weights in each forward pass.
I am surprised such a simple approach has taken so long to be actually used. My first image description cli attempt did basically that: Use n to get several answers and another pass to summarize.
People have played with (multi-) agentic frameworks for LLMs from the very beginning but it seems like only now with powerful reasoning models it is really making a difference.
It's very resource intensive so maybe they had to wait until processes got more efficient? I can also imagine they would want to try and solve it in a... better way before doing this.
I have a similar thing built around a year ago w/ autogen. The difference now is models can really be steered towards "part" of the overall goal, and they actually follow that.
Before this, even the best "math" models were RLd to death to only solve problems. If you wanted it to explore "method_a" of solving a problem you'd be SoL. The model would start like "ok, the user wants me to explore method_a, so here's the solution: blablabla doing whatever it wanted, unrelated to method_a.
Similar things for gathering multiple sources. Only recently can models actually pick the best thing out of many instances, and work effectively at large context lengths. The previous tries with 1M context lengths were at best gimmicks, IMO. Gemini 2.5 seems the first model that can actually do useful stuff after 100-200k tokens.
I agree but I think its hard to get a sufficient increase in performance that would justify 3-4x increase in cost.
It's an expensive approach, and depends on assessment being easy, which is often not the case.
Surprised no one has released an app yet that pits all the major models against each other for a final answer.
Is o3-pro the same as these?
No, it doesn't take 30 minutes
This isn't the exact same model that achieved gold in the IMO a few weeks ago but is a close relative: https://x.com/OfficialLoganK/status/1951262261512659430
It's not yet available via an API.
I find it interesting, how OpenAI came out with a $200 plan, Anthropic did $100 and $200, then Gemini ups it to $250, and now Grok is at $300.
OpenAI is the only one that says "practically unlimited" and I have never hit any limit on my ChatGPT Pro plan. I hit limits on Claude Max (both plans) several times.
Why are these companies not upfront about what the limits are?
Because they want to have their cake and eat it too.
A fair pricing model would be token-based, so that a user can see for each query how much they cost, and only pay for what they actually used. But AI companies want a steady stream of income, and they want users to pay as much as possible, while using as little as possible. Therefore they ask for a monthly or even yearly price with an unknown number of tokens included, such that you will always pay more then with token-based payments.
Personally, I prefer having a fixed, predictable, price rather than paying for usage. There is something psychologically nicer about it to me, and I find myself rationing my usage more when I am using the API (which is effectively what you describe already, just minus the UI).
1 reply →
I don't think it's that, I think they just want people to onboard onto these things before understanding what the actual cost might be once they're not subsidized by megacorps anymore. Something similar to loss-leading endeavors like Uber and Lyft in the 2010s, I suspect that that showing the actual cost of inference would raise questions about the cost effectiveness of these things for a lot of applications. Internally, Google's data query surface tell you cost in terms of SWE-time (e.g. this query cost 1 SWE hour) since the incentives are different.
you're right, about their intentions in the future. But right now, they are literally losing money every single time someone uses their product...
In most cases, atleast claude does for sure. So yea, for now, they're losing money anyways
1 reply →
> Why are these companies not upfront about what the limits are?
Most likely because they reserve the right to dynamically alter the limits in response to market demands or infrastructure changes.
See, for instance, the Ghibli craze that dominated ChatGPT a few months ago. At the time OpenAI had no choice but to severely limit image generation quotas, yet today there are fewer constraints.
Per-usage pricing discourages use which limits how critical a service can be to your life or workflow. These companies want you to rely on the service such that you’ll pay the price. One customer might use it once a day and find the price reasonable; another may use it 10 times a day and still find the price reasonable. This kind of broad pricing allows for this variation.
Because if you are transparent about the limits, more people will start to game the limits, which leads to lower limits for everyone – which is a worse outcome for almost everyone.
tldr: We can't have nice things, because we are assholes.
Been using Gemini for a few months, somehow it's gotten much, much worse in that time. Hallucinations are very common, and it will argue with you when you point it out. So, don't have much confidence.
In my experience with chat, Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.
Pro is frustrating because it too often won't search to find current information, and just gives stale results from before its training cutoff. Flash doesn't do this much anymore.
For coding I use Pro in Gemini CLI. It is amazing at coding, but I'm actually using it more to write design docs, decomp multi-week assignments down to daily and hourly tasks, and then feed those docs back to Gemini CLI to have it work through each task sequentially.
With a little structure like this, it can basically write its own context.
I like flash because when it's wrong it's wrong very quickly. You can either change the prompt or just solve the problem yourself. It works well for people who can spot the answer as being "wrong"
> Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.
Same I think also Pro got worse...
interesting out of all "thinking models," I struggle with Gemini the most for coding. Just can't make it perform. I feel like they silently nerfed it over the last months.
1 reply →
I feel the same, but cannot measure the effect in any context benchmark like fiction.livebench.
Are they aggressively quantizing, or are our expectations silently increasing ?
Yeah, it's hard to measure. Not sure about our expectations, though I recall way better output when I first started using Gemini 2.5 vs now. It seems to be stupider and more headstrong somehow?
my recent experience with flash and using it to prototype a c++ header i was developing:
- it was great to brainstorm with but it routinely introduced edits and dramatic code changes, often unnecessary and many times causing regressions to existing, tested code. - numerous times recursion got introduced to revisions without being prompted or without any justified or good reason - hallucinated a few times regarding c++ type deduction semantics
i eventually had to explicitly tell it to not introduce edits in any working code being iterated on without first discussing the changes, and then being prompted by me to introduce the edits.
all in all i found base chatgpt a lot more productive and accurate and ergonomic for iterating (on the same problem just working it in parallel with gemini).
- code changes were not always arbitrarily introduced or dramatic - it attempted to always work with the given code rather than extrapolate and mind read - hallucinated on some things but quickly corrected and moved forward - was a lot more interactive and documenting - almost always prompted me first before introducing a change (after providing annotated snippets and documentation as the basis for a proposed change or fix)
however, both were great tools to work with when it came to cleaning up or debugging existing code, especially unit testing or anything related to TDD
Same here. I stopped using Gemini Pro because on top of it's hard to follow verbosity it was giving contradicting answers. Things that Claude Sonnet 4 could answer.
Speaking of Sonnet, I feel like it's closing the gap to Opus. After the new quotas I started to try it before Opus and now it gets complex things right more often than not. This wasn't my experience just a couple of months ago.
Is the problem mainly with tool use ? and are you using it through AI studio or through the API ?.
I've found that it hallucinates tool use for tools that aren't available and then gets very confident about the results.
Via the chat prompt mostly, and sometimes via Copilot. It was quoting me sources and links that didn't exist, and when I told it the links were wrong it doubled down forever, no matter how hard I tried to tell it otherwise. Even sent screenshots, etc.
Kinda just got stuck in a self-confident loop that time. Other times the output is just far worse than Claude for similar use cases, where a couple months back it was stronger, at least in my subjective experience.
I can’t even convince Gemini CLI while planning things to not go off and make a bunch of random changes on its own, even after being very clear not to do so, intercepting to tell it to stop doing that, then it just continues on fucking everything up.
Agents muddy the waters.
Claude Code gets the most out of Anthropic’s models, that’s why people love it.
Conversely, Gemini CLI makes Gemini Pro 2.5 less capable than the model itself actual is.
It’s such a stark difference I’ve given up using Gemini CLI even with it being free, but still use it for situations amenable to a prompt interface on a regular basis. It’s a very strong model.
That's my experience too, when I give Gemini CLI a big, general task and just let it run.
But if I give it structure so it can write its own context, it is truly astonishing.
I'll describe my big, general task and tell it to first read the codebase and then write a detailed requirements document, and not to change any code.
Then I'll tell it to read the codebase and the detailed requirements document it just wrote, and then write a detailed technical spec with API endpoints, params, pseudocode for tricky logic, etc.
Then I'll tell it to read the codebase, and the requirements document it just wrote, and the tech spec it just wrote, and decomp the whole development effort into weekly, daily and hourly tasks to assign to developers and save that in a dev plan document.
Only then is it ready to write code.
And I tell it to read the code base, requirements, tech spec and dev plan, all of which it authored, and implement Phase 1 of the dev plan.
It's not all mechanical and deterministic, or I could just script the whole process. Just like with a team of junior devs, I still need to review each document it writes, tweak things I don't like, or give it a better prompt to reflect my priorities that I forgot to tell it the first time, and have it redo a document from scratch.
But it produces 90% or more of its own context. It ingests all that context that it mostly authored, and then just chugs along for a long time, rarely going off the rails anymore.
> If you’re a Google AI Ultra subscriber, you can use Deep Think in the Gemini app today with a fixed set of prompts a day by toggling “Deep Think” in the prompt bar when selecting 2.5 Pro in the model drop down.
If fixed set means fixed number it would be nice to know how many.
Otherwise i would like to know what fixed set means here.
You get 10 requests per day it seems.
Apparently the model will think for 30+ minutes on a given prompt. So it seems it's more for research or dense multi-faceted problems than for general coding or writing fan fic.
I think at this point its fair to say that I've switched modals more than Leonardo Dicaprio.
Asking AI to create 3D scenes like the example in the page seems like asking someone to hammer something with a screwdriver, we would need an AI compatible 3D software that either has easier to use voxels built in so it can create similar to pixel art, or easier math defined curves that can be meshed, either way AI just does not currently have the right tools to generate 3D scenes
One of the most alluring things about LLMs though is that it is like having a screwdriver that can just about work like a hammer, and draft an email to your landlord and so on
I'm wondering if 'slow AI' like this is a temporary bridge, or a whole new category we need to get used to. Is the future really about having these specialized 'deep thinkers' alongside our fast, everyday models? Or is this just a clunky V1 until the main models get this powerful on their own in seconds?
It's not unreasonable to think that with improvements on the software side - a Saturn-like model based on diffusion could be this powerful within a decade - with 1s responses.
I'd highly doubt in 10 years, people are waiting 30m for answers of this quality - either due to the software side, the hardware side, and/or scaling.
It's possible in 10 years, the cost you pay is still comparable, but I doubt the time will be 30m.
It's also possible that there's still top-tier models like this that use absurd amounts of resources (by today's standards) and take 30m - but they'd likely be at a much higher quality than today's.
The pressure in the other direction is tool use. The more a model wants to call out to a series of tools, the more the delay will be, just because it the serial process isn't part of the model.
we’re optimizing for quality over performance right now, at some point the pendulum will swing the other way, but there might be problems that require deep thinking just like we have a need for supercomputers to run jobs for days today.
Wait...
So if someone cool enough, they could actually give us a DeepThought model?
Please, let that happen.
Vendor-DeepThought-42B maybe?
> could actually give us a DeepThought model
Yes, but the response time is terrible. 7.5 million years
Qwen-DeepInThot-69b-Instruct is what I’m looking forward to.
This comes at a time where my experience with Gemini is lacking, it seems to get worse. It's not picking up on my intention, sometimes replies in the wrong language, etc. Either that or I am just transparent that it's a tool and its feelings are hurt. I've had to call it a moron several times, and it was funny when it started reprimanding me for my foul language once. But it was wrong. This behavior seems new. I could never trust it to not do random edits everywhere in a document, so nowadays I use it to check Claude, which can be trusted with a document.
I had a good experience with gemini-cli (which I think uses pro initially?). It's not very good, but it's very fast. So when it's wrong it's wrong very quickly and you can either solve it yourself or pivot your prompt. For a professional software engineer this actually works out OK.
I would be interested in reading about how people who are paying for access to Google's top AI plan are intending to use this. Do you have any examples of immediate use-cases that might benefit?
Is Google using this tool internally? One would expect them to give some examples of how it's helping internal teams accelerate or solve more challenging problems, if they were eating their own dogfood.
I'm guessing most Ultra users are there for Veo 3, where it has monetary benefits if your 3s video go viral on TikTok/Reels/Shorts
I've been using Gemini 2.5 Pro for a few months now and have found my experience to be very positive. I primarily use it for coding and through the API, and I feel it's consistently improving. I haven't yet tried Deep Think, though.
At the moment, Deep Think is only available with the ULTRA subscription ($250 per month).
Is it available in EU? Someone can confirm?
It’s not available through the API?
Can someone with an ultra subscription ask it how many rs are in the word strawberry? And report back how long it takes to answer?
I used to be enthusiastic about Gemini-2.5-Pro but now I can't even get it to do decent file-diff summaries for a PR commit.
Upgraded and quickly hit my limit. And find that they have limits, I just wish that they were more transparent. Even if it's just a vague statement about limited usage. I assumed it would be similar to regular Gemini 2.5 on the pro plan but it's not
I wonder how long it will take for someone starts o investigating the relation between SEO and Google AI crawler
Is this a joke? I am paying $250/month because I want to use Deep Think and I literally just burned through my daily quota in 5 PROMPTS. That is insane!
If anyone wants a free to use research agent try rex I've been building it for a couple months: https://projectrex.onrender.com/
If anyone wants a free to use deep research agent check out what i'm working on: https://projectrex.onrender.com/
They missed an oppportunity to name it Deep Thought.
would have made possible a class of jokes: - i wonder how many iterations we need with it before succeeding - i ran it 5 minutes and the deep thought model started to hallucinate, i hope not because oxygen deprivation ...
Available via API?
Not yet. https://x.com/OfficialLoganK/status/1951260803459338394 said "Should we put it in the Gemini API next?"
Going to twitter to find info is exactly what’s wrong with Google
Is it something like tree of thought reasoning?
These AI names are getting ridiculous
Interesting they compared Gemini 2.5 Deep Think on code to all the top models EXCEPT Claude, the best top model at code.
Grok 4 heavy, o3 pro and Gemini Deep Think all are equivalent. I wonder how they compare?
[flagged]
[flagged]
[flagged]
> Can AI suck my balls?
Not quite yet. But if you live long enough an AI robot will grant you these and other similar wishes.
[flagged]
[flagged]
Your comments were being flagged by users, for what should be obvious reasons.
139.99€/month for something you can't even test first, lol
You can't go anywhere without having Gemini shoved in your face. I had an immediate visceral reaction to this.
I’m never the one to defend AI, but what do you mean? Is it the “AI overview” that pops up on Google? Other than that, I would say Gemini is definitely less in your face than ChatGPT for example
My company uses google workspace and every google doc, spreadsheet, calendar, online meeting and search puts nonstop callouts and messages about using Gemini. It's gotten so bad that I'm about to try building a browser extension to block that bullshit. It clutters the UI and nags. If I wanted that crap, I'd turn it on.
5 replies →
The latest Samsung updates (and Pixel too I imagine?) baked it into the OS and made it difficult/impossible to disable. It’s probably that. I also agree, aside from that I haven’t seen anything about Gemini at all, I think their marketing is quite poor for something so important.