Comment by foundry27
1 day ago
I started doing some experimentation with this new Deep Think agent, and after five prompts I reached my daily usage limit. For $250 USD/mo that’s what you’ll be getting folks.
It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy. Anecdotally (from my experience) this was the one feature that enthusiasts in the AI community were interested in to justify the exorbitant price of Google’s Ultra subscription. I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
Performance-wise. So far, I couldn’t even tell. I provided it with a challenging organizational problem that my business was facing, with the relevant context, and it proposed a lucid and well-thought-out solution that was consistent with our internal discussions on the matter. But o3 came to an equally effective conclusion for a fraction of the cost, even if it was less “cohesive” of a report. I guess I’ll have to wait until tomorrow to learn more.
They might not have been ready/optimized for production, but still wanted to release it before Aug 2 EU AI Act, this way they have 2 years for compliance. So the strategy with aggressively rate-limit for few users make sense.
wheee, great way to lock in incumbents even more or lock out the EU from startups
Welcome to the lovely world of regulation, enjoy your stay.
Several years ago I thought a good litmus test for mastery of coding is not finding a solution using internet search nor getting well written questions about esoteric coding problems answered on StackOverflow. For a while, I would post a question and answer my own question after I solved the problem for posterity (or AI bots). I always loved getting the "I've been working on this for 3 days and you saved my life" comments.
I've been working on a challenging problem all this week and all the AI copilot models are worthless helping me. Mastery in coding is being alone when nobody else nor AI copilots can help you and you have dig deep into generalization, synthesis, and creativity.
(I thought to myself, at least it will be a little while longer before I'm replaced with AI coding agents.)
Your post misses the fact that 99% of programming is repetitive plumbing and that the overwhelming majority of developers, even ivy league graduates, suck at coding and problem solving.
Thus, AI is a great productivity tool if you know how to use it for the overwhelming majority of problems out there. And it's a boost even for those that are not even good at the craft as well.
This whole narrative of "okay but it can't replace me in this or that situation" is honestly between an obvious touche (why would you think AI would replace rather than empower those who know their craft) and stale luddism.
> 99% of programming is repetitive plumbing
Even IF that were true (and I'd argue that it is NOT, and it's people who believe that and act that way who produce the tangled messes of spiderweb code that are utterly opaque to public searches and AI analysis -- the supposed "1%"), if even as low as 1% of the code I interacted with was the kind of code that required really deep thought and analysis, it could easily balloon to take up as much time as the other "99%".
Oh, and Ned Ludd was right, by the way. Weavers WERE replaced by the powered loom. It is in the interest of capital to replace you if they are able to, not to complement you, and furthermore, the teeth of capital have gotten sharper over time, and its appetite more voracious.
2 replies →
I've started to come to the conclusion that only greenfield projects consist of repetitive plumbing. Legacy software is like plumbing if all the pipes were tied into a knot. The edge cases, ambiguous naming, hacky solutions, etc. all make for a miserable experience, both for humans and AIs.
Curious to know what are those challenging programming problems are. Can you share some examples?
They're remarkably useless on stuff they've seen but not had up-weighted in the training set. Even the best ones (Opus 4 running hot, Qwen and K2 will surprise you fairly often) are a net liability in some obscure thing.
Probably the starkest example of this is build system stuff: it's really obvious which ones have seen a bunch of `nixpkgs`, and even the best ones seem to really struggle with Bazel and sometimes CMake!
The absolute prestige high-end ones running flat out burning 100+ dollars a day and it's a lift on pre-SEO Google/SO I think... but it's not like a blowout vs. a working search index. Back when all the source, all the docs, and all the troubleshooting for any topic on the whole Internet were all above the fold on Google? It was kinda like this: type a question in the magic box and working-ish code pops out. Same at a glory-days FAANG with the internal mega-grep.
I think there's a whole cohort or two who think that "type in the magic box and code comes out" is new. It's not new, we just didn't have it for 5-10 years.
Yeah, I mean the interpolation part is new, but boy do I miss pre-enshitification Google!
I have similar issues with support form companies that heavily push AI and self-serve models and make human support hard. I'm very accomplished and highly capable. If I feel the need to turn to support, the chances the solution is in a KB is very slim, same with AI. It'll be a very specific situation with a very specific need.
There are a lot of internal KB's companies keep to themselves in their ticketing systems - would be interesting to estimate how much good data there is in there that could in the future be used to train more advanced (or maybe more niche or specific) AI models.
This has been my thought for a long time - unless there is some breakthrough in AI algo I feel like we are going to hit a "creativity wall" for coding (and some other tasks).
Any reason to think that the wall will be under the human level?
3 replies →
> It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
It's interesting how multi-dimensional LLM capabilities have proven to be.
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
Those may have been the exact people creating training material for OpenAI…
It's just wildly inconsistent to me. Some times it'll produce a work of genius. Other times, total garbage.
Unfortunately we are still in the prompt optimization stage, garbage in garbage out
4 replies →
> I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
I agree that’s not a good posture, but it is entirely unsurprising.
Google is probably not profiting from AI Ultra customers either, and grabbing all that sweet usage data from the free tier of AI Studio is what matters most to improve their models.
Giving free access to the best models allows Google to capture market share among the most demanding users, which are precisely the ones that will be charged more in the future. In a certain sense, it’s a great way for Google to use its huge idle server capacity nowadays.
I'm burning well over 10 millions tokens a day on free tier. 99% of the input is freely availzble data, the rest is useless. I never provided any feedback. Sure there is some telemetry, they can have it.
I doubt I'm an isolated case. This Gemini gig will cost Google a lot, they pushed it on all android phones around the globe. I can't wait to see what happens when they have to admit that not many people will pay over 20 bucks for "Ai", and I would pay well over 20 bucks just to see the face of the c suite next year when one dares to explain in simple terms there is absolutely no way to recoup the DC investment and that powering the whole thing will cost the company 10 times that.
It's not particularly interesting if Deep Mind comes to the same (correct) conclusion on a single problem as o3 but costs more. You could ask gpt 2.5 and gpt4 what 1+1= and would get the same response with gpt 4 costing more, but this doesn't tell us much about model capability or value.
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
> It would be more interesting to know if it can handle problems that o3 can't do
Suppose it can't. How will you know? All the datapoints will be "not particularly interesting".
> Suppose it can't. How will you know?
By finding and testing problems that o3 can't do on Deep Think, and also testing the reverse? Or by large benchmarks comparing a whole suite of questions with known answers.
Problems that both get correct will be easy to find and don't say much about comparative performance. That's why some of the benchmarks listed in the article (e.g. Humanity's Last Exam / AIME 2025) are potentially more insightful than one person's report on testing one question (which they don't provide) where both models replied with the same answer.
I'm not in the AI sceptic camp (LLMs can be useful for some tasks, and I use them often), but this is the big issue at the moment.
In order for agentic AI to replace (for example) a software engineer, we need a big step up in capability, around an order of magnitude. These chain of thought models do get a bit closer to that, although in my opinion we're still a way away.
However, at the same time we need about an order of magnitude decrease in price. These models are expensive even at the current price tokens are sold at which seems to be below the actual cost. And these massive CoT models are taking us in completely the wrong direction in terms of cost
It could be that your problem was too simple to justify the use of Deep Think.
But yes, Google should have figured that out and used a less expensive mode of reasoning.
"I'm sorry but that wasn't a very interesting question you just asked. I'll spare you the credit and have a cheaper model answer that for you for free. Come back when you have something actually challenging."
Actually why not? Recognizing problem complexity as a fist step is really crucial for such expensive "experts". Humans do the same.
And a question to the knowledgeable: does a simple/stupid question cost more in terms of resources then a complex problem? in terms of power consumption.
7 replies →
I know this is a joke but I have been able to lower my costs by routing my prompts to a smaller model to determine if I need to send it to a larger model or not.
“This meeting could’ve been an email”
Model routing is deceptively hard though. It has halting problem characteristics: often only the smartest model is smart enough to accurately determine a task's difficulty. And if you need the smartest model to reliably classify the prompt, it's cheaper to just let it handle the prompt directly.
This is why model pickers persist despite no one liking them.
Yes but prompt evaluation is far faster than inference as it can be done (mostly) in parallel, so I don't think that's true.
8 replies →
but if the less strong model has low false positives you can just route them in order of strength
1 reply →
Interestingly Gemini CLI has a very generous free quota. Is Google's strategy just overpricing some stuff and subsidizing the underpriced stuff?
This is the fundamental pricing strategy of all modern software in fact.
Underpriced for consumers, overpriced for businesses.
It doesn't, it's not "1000 Gemini Pro" requests for free, Google misled everyone. It's 1000 Gemini requests, Flash included. You get like 5-7 Gemini Pro requests before you get limited.
I'm getting 100 Gemini Pro requests per day with an AI Studio API key that doesn't have billing enabled.
After that it's bumped down to Flash, which is surpisingly effective in Gemini CLI.
If I need Pro, I just swap in an API from an account with billing enabled, but usually 100 requests is enough for a day of work.
2 replies →
I've found the free version swaps away from pro incredibly fast. But our company has gemini but can't even get that - we were being asked to do everything by API key.
I suspect that the main goal here was to grab the top spot in a bunch of benchmarks, and being counted as an "available" model.
They're using it as a major inducement to upgrade to AI Ultra. I mean, the image and video stuff is neat, but adds no value for the vast majority of AI subscribers, so right now this is the most notable benefit of paying 12x more.
FWIW, Google seems to be having some severe issues with oddball, perhaps malfunctioning quota systems. I'm regularly finding extraordinarily little use of gemini-cli is hitting the purported 1000 request limit, when in reality I've done less than 10.
I faced the exact same problem, with the API. It seems that it doesn't throttle early enough, then may cumulate the cool off period, malong it impossible to determine when to fire requests again.
Also, I noticed Gemini (even flash) has Google search support. But only via the web UI or the native mobile app. Via the API that would requires serp via MCP of sort. Even with Gemini pro.
Oh, some models are regularly facing outages. 503s are not uncommon. No SLA page, alerts, whatsoever.
The reasoning feature is buggy, even if disabled, it sometimes triggers anyway.
It occured to me the other day that Google probably have the best engineers given how good Gemini performs and where it's coming from, and the context window that is uniquely large compared to any other model. But that it is likely operated by managers coming from AWS where shipping half baked, barely tested software, was all it took to get a bonus.
I'd be interested in tests involving tasks with large amounts of context. Parallel thinking could conceivably useful for a variety of specific problem types. Having more context than any specific chain of thought can reasonably attend to might be one of them.
I have ultra. Will not be renewing it. Useless, at least have global limits and let people decide how they want to use it. If I have tokens left, why can't I use it for code assist?
What was the experimentation? Can you share with us so we can see how "bizarrely uncompetitive" it is?
Bizarrely uncompetitive is referencing the 5 uses per day not the performance itself
The part I cannot understand is why for many AI offerings, I cannot make out what each pricing tier does with a quick glance.
What happened to the simplicity of Steve Jobs' 2x2 (consumer vs.pro, laptop vs. desktop)?
Similar complaints are happening all over reddit with the Claude Code $200/mo plan and Cursor. The companies with deep VC funding have been subsidizing usage for a year now, but we're starting to see that bleed off.
I think the primary concern of this industry right now is how, relative to the current latest generation models, we simultaneously need intelligence to increase, cost to decrease, effective context windows to increase, and token bandwidths to increase. All four of these things are real bottlenecks to unlocking the "next level" of these tools for software engineering usage.
Google isn't going to make billions on solving advanced math exams.
Agreed, and big context windows are key to mass adoption in wider use cases beyond chatbots (random ex: in knowledge management apps, being able to parse the entire note library/section and hook it into global AI search), but those use cases are decidedly not areas where $200 per month subscriptions can work.
I'll hazard to say that cost and context windows are the two key metrics to bridge that chasm with acceptable results.... As for software engineering though, that cohort will be demanding on all front for the foreseeable future, especially because there's a bit of a competitive element. Nobody wants to be the vibecoder using sub-par tools compared to everyone else showing off their GitHub results and making sexy blog posts about it on HN.
Outside of code, the current RAG strategy is throw shit tons of unstructured text at it that has been found using vector search. Some companies are doing better, but the default rag pipelines are... kind of garbage.
For example, a chat bot doing recipe work should have a RAG DB that, by default, returns entire recipes. A vector DB is actually not the solution here, any number of traditional DBs (relational or even a document store) would work fine. Sure do a vector search across the recipe texts, but then fetch the entire recipe from someplace else. Current RAG solutions can do this, but the majority of RAG deployments I have seen don't bother, they just abuse large context windows.
Which looks like it works, except what you actually have in your context window is 15 different recipes all stitched together. Or if you put an entire recipe book into the context (which is perfectly doable now days!), you'll end up with the chatbot mixing up ingredients and proportions between recipes because you just voluntarily polluted its context with irrelevant info.
Large context windows allow for sloppy practices that end up making for worse results. Kind of like when we decided web servers needed 16 cores and gigs of RAM to run IBM Websphere back in the early 2000s, to serve up mostly static pages. The availability of massive servers taught bad habits (huge complicated XML deployment and configuration files, oodles of processes communicating with each other to serve a single page, etc).
Meanwhile in the modern world I've ran mission critical high throughput services for giant companies on a K8 cluster consisting of 3 machines each with .25 CPU and a couple hundred megs of RAM allocated.
Sometimes more is worse.
5 replies →
Big, coherent context windows are key to almost all use-cases. The whole house of cards RAG implementations most platforms are using right now are pretty bad. You start asking around about how to implement RAG and you realize: No one knows, the architecture and outcomes at every company are pretty bad, the most common words you hear are "yeah it pretty much works ok i guess".
> Similar complaints are happening all over reddit with the Claude Code $200/mo
I would imagine 95% of people never get anywhere near to hitting their CC usage. The people who are getting rate-limited have ten windows open, are auto-accepting edits, and YOLO'ing any kind of coherent code quality in their codebase.
it turns out that AI at this level is very expensive to run (capex, energy). my bet is that AI itself won't figure out how to overcome these constraints and reach escape velocity.
Perhaps this will be the incentive to finally get fusion working. Big tech megacorps are flush with cash and could fund this research many times over at current rates. E.g. NIF is several billion dollars; Google alone has almost $100B in the bank.
Mainframes are the only viable way to build computers. Micro processors will never figure out how to get small and fast enough for personal computers to reach escape velocity.
Why do you think the analogy hold?
3 replies →
> it turns out that AI at this level is very expensive to run (capex, energy)
If it's CapEx it's -- by definition -- not a cost to run. Energy costs will trend to zero.
Why will energy costs trend to zero?
1 reply →
our minds are incredibly energy efficient, that leads me to believe it is possible to figure out, but it might be a human rather than an AI that gives us something more akin to a biological solution.
This could fix my main gripe with The Matrix. ”Humans are used as batteries” always felt off, but it totally would make sense if the human brains have uniquely energy efficient pattern matching abilities that an emerging AI organism would harvest. That would also strengthen the spiritual humanist subtext.
3 replies →
Uncompetitive how, what task and Eval?
Gemini is consistently the only model that can reason over long context in dynamic domains for me. Deep Think just did that reviewing an insane amount of Claude Code logs - for a meta analysis task of the underlying implementation. Laughable to think Grok could do that.
The rate limits are not because of compute performance or the lack of. It's to stop people from training their own models on the very cutting edge.