Comment by amazingamazing
15 hours ago
Gemma4 in my view is good enough to do things similar to Gemini 2.5 flash, meaning if I point it code and ask for help and there is a problem with the code it’ll answer correctly in terms of suggestions but it’s not great at using all tools or one shooting things that require a lot of context or “expert knowledge”
If a couple more iterations of this, say gemma6 is as good as current opus and runs completely locally on a Mac, I won’t really bother with the cloud models.
That’s a problem.
For the others anyway.
I agree. At first I was really turned off by the Gemma 4 line of models because they didn’t function with coding agents as well as the qwen3.5 line of models. However, I found that for other use cases Gemma 4 was very good.
EDIT: I just saw this: “”Ollama 0.20.6 is here with improved Gemma 4 tool calling!”” I will rerun my tests after breakfast.
similar vibes as "640k ought to be enough for anybody"
I think the difference is that with LLMs, in a lot of cases you do see some diminishing returns.
I won't deny that the latest Claude models are fantastic at just one shotting loads of problems. But we have an internal proxy to a load of models running on Vertex AI and I accidentally started using Opus/Sonnet 4 instead of 4.6. I genuinely didn't know until I checked my configuration.
AI models will get to this point where for 99% of problems, something like Gemma is gonna work great for people. Pair it up with an agentic harness on the device that lets it open apps and click buttons and we're done.
I still can't fathom that we're in 2026 in the AI boom and I still can't ask Gemini to turn shuffle mode on in Spotify. I don't think model intelligence is as much of an issue as people think it is.
100% agree here. The actual practical bottleneck is harness and agentic abilities for most tasks.
It's the biggest thing that stuck out to me using local AI with open source projects vs Claude's client. The model itself is good enough I think - Gemma 4 would be fine if it could be used with something as capable as Claude.
And that's gonna stay locked down unfortunately especially on mobile and cars - it needs access to APIs to do that stuff - and not just regular APIs that were built for traditional invoking.
The same way that websites are getting llm.txts I think APIs will also evolve.
Agree on the diminishing returns,the Opus 4.6 anecdote is a good signal
I think security is the issue-ai is good at circumventing this. For example , ai can read paywalled articles you cannot. Do you really want ai to have ‘free range’.?
I mean to me even difference between Opus and Sonnet is as clear as day and night, and even Opus and the best GPT model. Opus 4.6 just seems much more reliable in terms of me asking it to do something, and that to actually happen.
7 replies →
Well you can do a lot with 640k…if you try. We have 16G in base machines and very few people know how to try anymore.
The world has moved on, that code-golf time is now spent on ad algorithms or whatever.
Escaping the constraint delivered a different future than anticipated.
> you can do a lot with 640k…if you try.
it is economically not viable to try anymore.
"XYZ Corp" won't allow their developers to write their desktop app in Rust because they want to consume only 16MB RAM, then another implementation for mobile with Swift and/or Kotlin, when they can release good enough solution with React + Electron consuming 4GB RAM and reuse components with React Native.
2 replies →
Especially if the 640k are "in your hand" and the rest is "in the cloud"
The simple fact is that a 16 GB RAM stick costs much less than the development time to make the app run on less.
8 replies →
People get hung up on bad optimization. It you are the working at sufficiently large scale, yes, thinking about bytes might be a good use of your time.
But most likely, it's not. At a system level we don't want people to do that. It's a waste of resources. Making a virtue out of it is bad, unless you care more about bytes than humans.
5 replies →
Look at the whole history of computing. How many times has the pendulum swung from thin to fat clients and back?
I don't think it's even mildly controversial to say that there will be an inflection point where local models get Good Enough and this iteration of the pendulum shall swing to fat clients again.
Assuming improvements in LLMs follow a sigmoid curve, even if the cloud models are always slightly ahead in terms of raw performance it won't make much of a difference to most people, most of the time.
The local models have their own advantages (privacy, no -as-a-service model) that, for many people and orgs, will offset a small performance advantage. And, of course, you can always fall back on the cloud models should you hit something particularly chewy.
(All IMO - we're all just guessing. For example, good marketing or an as-yet-undiscovered network effect of cloud LLMs might distort this landscape).
More than "a 3 year old laptop is fine"
My thinkpad is nearly 10 years old, I upgraded it to 32GB of ram and have replaced the battery a couple of times, but it's absolutely fine apart from that.
If AI which was leading edge in 2023 can run on a 2026 laptop, then presumably AI which is leading edge in 2026 will run on a 2029 laptop. Given that 2023 was world changing then that capacity is now on today's laptop
Either AI grows exponentially in which case it doesn't matter as all work will be done by AI by 2035, or it plateaus in say 2032 in which case by 2035 those models will run on a typical laptop.
The economy is, more or less, a competition.
If someone gets a really great axe and are happy with it, that’s great for them.
But then, other people will be on bulldozers.
They can say they are happy with the axe, but then they are not in the competition at that point.
I think the article was wondering how many billion dollar bulldozers the world needs. My local hardware store sells a variety of axes. I myself am a happy ax user. I even replace them.
> it’s not great at using all tools
Glad it wasnt just me - i was impressed with the quality of Gemma4 - it just couldnt write the changes to file 9/10 times when using it with opencode
https://huggingface.co/google/gemma-4-31B-it/commit/e51e7dcd...
There was an update to tool calling 3 days ago. I haven't tested it myself but hope it helps.
Wow, that is so much better! I didnt exactly test it extensively but my issues are gone.
Hmm.. is there an updated onnx?
> it just couldnt write the changes to file 9/10 times when using it with opencode
You might want to give this a try, it dramatically improves Edit tool accuracy without changing the model: https://blog.can.ac/2026/02/12/the-harness-problem/
Yep, and to be honest we don't really need local models for intensive tasks. At least yet. You can use openrouter (and others) to consume a wide variety of open models which are capable of using tools in an agentic workflow, close to the SOTA models, which are essentially commodities - many providers, each serving the same model and competing with each-other on uptime, throughput, and price. At some point we will be able to run them on commodity hardware, but for now the fact that we can have competition between providers is enough to ensure that rug pulls aren't possible.
Plus having Gemma on my device for general chat ensures I will always have a privacy respecting offline oracle which fulfils all of the non-programming tasks I could ever want. We are already at the point where the moat for these hyper scalers has basically dissolved for the general public's use case.
If I was OpenAI or Anthropic I would be shitting my pants right now and trying every unethical dark pattern in the book to lock in my customers. And they are trying hard. It won't work. And I won't shed a single tear for them.
Local models seem somewhere between 9 and 24 months behind. I'm not saying I won't be impressed with what online models will be able to do in two years, but I'm pretty satisfied with the prediction that I won't really need them in a couple of years.
We still aren't going to be putting 200gb ram on a phone in a couple years to run those local models.
HBF is coming fast, with the first examples expected to be sampling to users this year.
The storage technology of Flash memory can be optimized to be as fast and more energy-efficient than DRAM at large linear reads, there was just little demand before because doing so costs you ~half of your density and doesn't improve your writes at all. All the flash memory manufacturers realized that this is a huge opportunity for model weights and are now chasing this.
Or in other words, after the initial price peak stabilizes in a few years, it will be reasonable to put ~500GB of weights into a device for ~$100 in memory costs.
> We still aren't going to be putting 200gb ram on a phone in a couple years to run those local models.
You can already buy an iPhone with 2 TB of storage. The CPU, GPU and Neural Engine all share the same pool of RAM and the SSD is directly connected to all of this. You won’t need 200 GB of RAM to run local models when you essentially have 500 GB of virtual memory.
That amount of RAM won’t be necessary. Gemma 4 and comparably sized Qwen 3.5 models are already better than the very best, biggest frontier models were just 12-18 months ago. Now in an 18-36GB footprint, depending on quantization.
We don’t need 200gb of RAM on a phone to run big models. Just 200 GB of storage thanks to Apple’s “LLM in a flash” research.
See: https://x.com/danveloper/status/2034353876753592372
2 replies →
A lot of people are making the mistake of noticing that local models have been 12-24 months behind SotA ones for a good portion of the last couple years, and then drawing a dotted line assuming that continues to hold.
It simply.. doesn't. The SotA models are enormous now, and there's no free lunch on compression/quantization here.
Opus 4.6 capabilities are not coming to your (even 64-128gb) laptop or phone in the popular architecture that current LLMs use.
Now, that doesn't mean that a much narrower-scoped model with very impressive results can't be delivered. But that narrower model won't have the same breadth of knowledge, and TBD if it's possible to get the quality/outcomes seen with these models without that broad "world" knowledge.
It also doesn't preclude a new architecture or other breakthrough. I'm simply stating it doesn't happen with the current way of building these.
edit: forgot to mention the notion of ASIC-style models on a chip. I haven't been following this closely, but last I saw the power requirements are too steep for a mobile device.
11 replies →
> if I point it code and ask for help and there is a problem with the code it’ll answer correctly in terms of suggestions
could I ask how you do that? I installed openclaw and set it to use Gemma 4 but it didn't act in an agent mode at all, it only responded in the chat window while doing nothing, and didn't read any files or do anything that you wrote (though I see you do mention that it's not great at using all tools). What are you using exactly?
I had the same issues. I had to tell it to use sub agents explicitly, and instead of saying set a cron say set an openclaw cron.
I generally do like the model, it’s not a great agent though.
It’s good for summarization tasks, small tool use, and has pretty good world knowledge, though it does hallucinate.
But that difference atm is the difference between it being OK on its own with a team of subagents given good enough feedback / review mechanisms or having to babysit it prompt by prompt.
By the time gemma6 allows you to do the above the proprietary models supposedly will already be on the next step change. It just depends if you need to ride the bleeding edge but specially because it's "intelligence", there's an obvious advantage in using the best version and it's easy to hype it up and generate fomo.
> But that difference atm is the difference between it being OK on its own with a team of subagents given good enough feedback
Do people actually build meaningful things like that?
It's basically impossible to leave any AI agent unsupervised, even with an amazing harness (which is incredibly hard to build). The code slowly rots and drifts over time if not fully reviewed and refactored constantly.
Even if teams of agents working almost fully autonomously were reliable from a functional perspective (they would build a functional product), the end product would have ever increasing chaos structurally over time.
I'd be happy to be proven wrong.
[dead]
When that happens, you'll have fomo from not using opus 5.x. The numbers that they showed for Mythos show that the frontier is still steadily moving (and maybe even at a faster pace than before)
I would be surprised about that behavior even for 10% people doing real AI usable work. Very few people buy new motherboard or CPU or gfxcard every 3 months?
Even now just because the latest Anthropic is super great doesn't mean people are not using other models. Not everyone is subscribed to only the best.
There is a cognitive ceiling for what you can do with smaller models. Animals with simpler neural pathways often outperform whatever think they are capable of but there's no substitute for scale. I don't think you'll ever get a 4B or 8B model equivalent to Opus 4.6. Maybe just for coding tasks but certainly not Opus' breadth.
The only thing that we are sure can't be highly compressed is knowledge, because you can only fit so much information in given entropy budget without losing fidelity.
The minimal size limits of reasoning abilities are not clear at all. It could be that you don't need all that many parameters. In which case the door is open for small focused models to converge to parity with larger models in reasoning ability.
If that happens we may end up with people using small local models most of the time, and only calling out to large models when they actually need the extra knowledge.
> and only calling out to large models when they actually need the extra knowledge
When would you want lossy encoding of lots of data bundled together with your reasoning? If it is true that reasoning can be done efficiently with fewer parameters it seems like you would always want it operating normal data searching and retrieval tools to access knowledge rather than risk hallucination.
And re: this discussion of large data centers versus local models, do recall that we already know it's possible to make a pretty darn clever reasoning model that's small and portable and made out of meat.
2 replies →
I think you underestimate the amount of knowledge needed to deal with the complexities of language in general as opposed to specific applications. We had algorithms to do complex mathematical reasoning before we had LLMs, the drawback being that they require input in restricted formal languages. Removing that restriction is what LLMs brought to the table.
Once the difficult problem of figuring out what the input is supposed to mean was somewhat solved, bolting on reasoning was easy in comparison. It basically fell out with just a bit of prompting, "let's think step by step."
If you want to remove that knowledge to shrink the model, we're back to contorting our input into a restricted language to get the output we want, i.e. programming.
except you don't want knowledge in the model, and most of that "size" comes from "encoded knowledge", i.e. over fitting. The goal should be to only have language handling in the model, and the knowledge in a database you can actually update, analyze etc. It's just really hard to do so.
"world models" (for cars) maybe make sense for self driving, but they are also just a crude workaround to have a physics simulation to push understanding of physics. Through in difference to most topics, basic, physics tend to not change randomly and it's based on observation of reality, so it probably can work.
Law, health advice, programming stuff etc. on the other hand changes all the time and is all based on what humans wrote about it. Which in some areas (e.g. law or health) is very commonly outdated, wrong or at least incomplete in a dangerous way. And for programming changes all the time.
Having this separation of language processing and knowledge sources is ... hard, language is messy and often interleaves with information.
But this is most likely achievable with smaller models. Actually it might even be easier with a small model. (Through if the necessary knowledge bases are achievable to fit on run on a mac is another topic...)
And this should be the goal of AI companies, as it's the only long term sustainable approach as far as I can tell.
I say should because it may not be, because if they solve it that way and someone manages to clone their success then they lose all their moat for specialized areas as people can create knowledge bases for those areas with know-how OpenAI simple doesn't have access to. (Which would be a preferable outcome as it means actual competition and a potential fair working market.)
as a concrete outdated case:
TLS cipher X25519MLKEM768 is recommended to be enabled on servers which do support it
last time I checked AI didn't even list it when you asked it for a list of TLS 1.3 ciphers (through it has been widely supported since even before it was fully standardized..)
this isn't surprising as most input sources AI can use for training are outdated and also don't list it
maybe someone of OpenAI will spot this and feet it explicitly into the next training cycle, or people will cover it more and through this it is feed implicitly there
but what about all that many niche but important information with just a handful of outdated stack overflow posts or similar? (which are unlikely to get updated now that everyone uses AI instead..)
The current "lets just train bigger models with more encoded data approach" just doesn't work, it can get you quite far, tho. But then hits a ceiling. And trying to fix it by giving it also additional knowledge "it can ask if it doesn't know" has so far not worked because it reliably doesn't realize it doesn't know if it has enough outdated/incomplete/wrong information encoded in the model. Only by assuring it doesn't have any specialized domain knowledge can you make sure that approach works IMHO.
I think you are underestimating the strength a small model can get from tool use. There may be no substitute for scale, but that scale can live outside of the model and be queried using tools.
In the worst case a smaller model could use a tool that involves a bigger model to do something.
Small models are bad at tool use. I have liquidai doing it in the browser but it’s super fragile.
1 reply →