Comment by daxfohl

4 days ago

I just wonder how long it'll take local models to be good enough for 99% of use cases. It seems like it has to happen sooner or later.

My hunch is that in five years we'll look back and see current OpenAI as something like a 1970's VAX system. Once PCs could do most of what they could, nobody wanted a VAX anymore. I have a hard time imagining that all the big players today will survive that shift. (And if that particular shift doesn't materialize, it's so early in the game; some other equally disruptive thing will.)

In my experience with Gemini, most of its capabilities stem from web searching instead of something it has already "learned." Even if you could obtain the model weights and run them locally, the quality of the output would likely drop significantly without that live data.

To really have local LLMs become "good enough for 99% of use cases," we are essentially dependent on Google's blessing to provide APIs for our local models. I don't think they have any interest in doing so.

  • I agree 100%. Often when I use increasingly powerful local models (qwen3.5:32b I love you) I mix in web search using search APIs from Brave, Perplexity, and DuckDuckGo summaries. Of course this requires that I use local models via small Python or Lisp scripts I write. I pay for the Lumo+ private chat service and it has excellent integrated search, like Gemini or ChatGPT.

    EDIT: I have also experimented with creating a local search index for the common tech web sites I get information from - this is a pain in the ass to maintain, but offers very low latency to add search context for local model use. This is most useful with very small and fast local models so the whole experience is low latency.

    • Interesting idea on the local search index! It occurs to me that running something that passively saves down content that I browse and things that AI turns up while it does its own searches, plus a little agent to curate/expand/enrich/update the index could be super handy. I imagine once it had docs on the stuff I use most frequently that even a small model would feel quite smart.

      2 replies →

  • That's totally not my experience. The AI component (as opposed to the knowledge component) is really what makes these models useful, and you could add search as a tool. Of course for that you'll be dependent on a search provider, that's true.

    • You don't get the AI component without the knowledge component. The AI needs approximate knowledge of lots of things to conceptualize what you're talking about and use search tools effectively.

      The set of things it needs approximate knowledge over grows slowly but noticeably over time.

      2 replies →

  • This is actually so ironic. Corporations spent fortunes to design cool websites, but what people really want is structured, easy to read information in the context they want.

    So flow is you type search query to Gemini, Gemini uses Google search, scans few results, go to selected websites, see if there is anything relevant and then compose it into something structured, readable and easy to ingest.

    It's almost like going back to 90s browsing through forums, but this time Gemini is generating equivalent of forum posts "on the fly".

    • a long time ago ( in AI time ) Karpathy used the analogy that LLMs were like compression algorithms. I can see that now when i ask an LLM a question it's basically giving me back the whole internet compressed to the scope of my question.

  • Unless you can provide a (community) curated list of sources to search through (e.g. using MCP). Then I think local models may become really competitive.

Taking the opposite side of that bet, here is why:

* even if an openweight model appears on huggingface today, exceeding SOTA, given my extensive experience with a wide variety of model sizes, I would find it highly surprising the "99% of use cases" could be expressed in <100B model.

* Meanwhile: I pulled claude to look into consumer GPU VRAM growth rates, median consumer VRAM went 1-2GB @ 2015 to ~8GB @ 2026, rougly doubles every 5 years; top-end isn't much better, just ahead 2 cycles.

* Putting aside current ram sourcing issues, it seems very unlikely even high-end prosumers will routinely have >100GB VRAM (=ability to run quantized SOTA 100b model) before ~2035-2040.

  • Even with inflated RAM prices, you can buy a Strix Halo Mini PC with 128GB unified memory right now for less than 2k. It will run gpt-oss-120b (59 GB) at an acceptable 45+ tokens per second: https://github.com/lhl/strix-halo-testing?tab=readme-ov-file...

    I also believe that it should eventually be possible to train a model with somewhat persistent mixture of experts, so you only have to load different experts every few tokens. This will enable streaming experts from NVMe SSDs, so you can run state of the art models at interactive speeds with very little VRAM as long as they fit on your disk.

    • I agree the parent is a bit too pessimistic, especially because we care about logical skills and context size more than remembering random factoids.

      But on a tangent, why do you believe in mixture of experts?

      Every thing I know about them makes me believe they're a dead-end architecturally.

      2 replies →

  • There will be companies producing ICs for cheap models, like Taalas or Axelera.ai today. These models will not be as good as the SOTA models, but because they are so fast, in a multi-agent approach with internet/database connectivity they can be as good as SOTA models, at least for the general public.

  • The GPU makers have been purposely stunting VRAM growth for years to not undercut their enterprise offerings.

  • yeah but effective GPU RAM has ramped thanks to unified mem on apple. The 5y thing doesn't hold anymore.

  • I agree, but I'm holding out hope that ASICs, unified RAM, and/or enterprise to consumer trickle-down will outpace consumer GPU VRAM growth rates.

  • Increasing model size doesn't make your model smarter, it just makes it know more facts.

    There's easier ways to do that.

The trend with email, websites and so on has been to use some large cloud service rather than self host as it's easier. My bet is AI will be similar.

  • You can turn a local model on and off as needed, and it will still function as expected. If you turn off your self-hosted server, you don't get email.

    With self-hosted email, you need persistent infrastructure and domain knowledge to leverage it. With a local model, you just click a button and tell it what to do.

    With email, there is a necessary burden to outsource. Your local model is just there like Chrome/Edge/Safari is just there, there is no burden.

  • But AI is not about connectivity. Local models are just about as useful without an internet connection. Also, the hardware can fit in a small enclosure.

5 years is a bit optimistic. I have no desire to use anything dumber than Claude - but I doubt I'll need something much smarter either - or with so much niche knowledge baked in. The harness will take care of much. Faster would be nicer though.

That still requires a pretty large chip, and those will be selling at an insane premium for at least a few more years before a real consumer product can try their hand at it.

  • Coding, via something like Claude or Codex, will likely always be something best done by hosted cloud models simply because the bar there can always be higher. But it's already entirely possible to run local models for chat and research and basic document creation that can compete perfectly fine with the cloud models from 6 months to a year ago. The limitation at this point is just the cost of RAM.

    This week's released of the new smaller Qwen 3.5 models was interesting. I ran a 4-bit quant of the 122b model on my NVIDIA Spark, and it's... pretty damn smart. The smaller models can be run at 8-bits on machines at very reasonable speeds. And they're not stupid. They're smarter than "ChatGPT" was a year or so ago.

    AMD Strix Halo machines with 128GB of RAM can already be bought off the shelf for not-insane prices that can run these just fine. Same with M-series Macs.

    Once the supply shocks make their way through the system I could see a scenario where it's possible that every consumer Mac or Windows install just comes with a 30B param or even higher model onboard that is smart enough for basic conversation and assistance, and is equipped with good tool use skills.

    I just don't see a moat for OpenAI or Anthropic beyond specialized applications (like software development, CAD, etc). For long-tail consumer things? I don't see it.

    • Even for coding. I mean, there's what, maybe a few thousand common useful technologies, algorithms, and design patterns? A million uncommon ones? I think all that could fit in a local model at some point.

      Especially if, for example, Amazon ever develops an AWS-specific model that only needs to know AWS tech and maybe even picks a single language to support, or maybe a different model for each language, etc. Maybe that could end up being tiny and super fast.

      I mean, most of what we do is simple CRUD wrappers. Sometimes I think humans in the loop cause more problems than we solve, overindexing on clever abstractions that end up mismatching the next feature, painting ourselves into fragile designs they can't fix due to backward compatibility, using dozens of unnecessary AWS features just for the buzz, etc. Sometimes a single monolith with a few long functions with a million branches is really all you need.

      Or, if there's ever a model architecture that allows some kind of plugin functionality (like LoRA but more composable; like Skills but better), that'd immediately take over. You get a generic coding skeleton LLM and add the plugins for whatever tech you have in your stack. I'm still holding out for that as the end game.

  • Yeah, post-Moore's Law anyway. But there could also be real breakthroughs in model architecture. Maybe something replaces transformer with better than quadratic scaling, or MoE lets smaller models and agent farms compete, or, who knows....

I hope you're right, but is there any guarantee that there will continue to be institutions willing to spend the money to produce open models?

I almost wonder if we need some sort of co-op for training and another for hosted inference

  • There doesn't seem to be any sign of Chinese companies stopping to produce open models to destroy the American moat.

    Given that a lot of the R&D in China is state sponsored that also seems to be a good pawn in US-China relations.

    • Eventually there'll be some kind of standard for licensing that's required of LLM runtimes, like software and digital media. Of course people will figure out workarounds, but just like pirated software, half of it will be infested with malware so most people will just pay for the license.

Think a large portion of people won’t take “good enough” if better is available for cheaper.

Datacenters simply scale better than homesevers on cost and performance

So only really works for people that value local highly - which isn’t most people.

  • Why would we assume the remote providers are going to be cheaper? They are burning cash, and Claude is already jacking up prices.

    "Local" is the means to an end, not the value prop itself. The value prop is "fast, private, and free", which I think is going to be very compelling.

> I just wonder how long it'll take local models to be good enough for 99% of use cases.

Qwen 2.5 was already there. "99% of use cases" isn't a very high bar right now.

Yesterday I asked mistral to list five mammals that don't have "e" in their name. Number three was "otter" and number five was "camel".

phi4-mini-reasoning took the same prompt and bailed out because (at least according to its trace) it interpreted it as meaning "can't have a, e, i, o, or u in the name".

Local is the only inference paradigm I'm interested in, but these things have a way to go.

  • I don't really see the problem here. Yeah, we know that these models are not good for actual logic. These models are lossy data compression and most-likely-responses-from-internet-forums-and-articles machines.

    This kind of parlor tricks are not interesting and just because a model can list animals with or without some letters in their names doesn't mean anything especially since it isn't like the model "thinks" in English it just gives you the answer after translating it to English.

    These are funny, like how you can do weird stuff with JavaScript language by combining special characters, but that doesn't really mean anything in the grand scheme of things. Like JavaScript these models despite their specific flaws still continue to deliver value to people using them.

    • You don't see the problem with a multi billion dollar project not able to give a correction answer to a trivial question? This tech is supposed to revolutionize business, increase productivity to unfathomable levels, automate all our dull boring tasks so we can focus on interesting things! Where have you been the past 4 years?

      2 replies →

    • Is this parlour trick so different from useful tasks like “implement this feature while following the naming conventions of my project”?

      3 replies →

  • Models will always struggle with this specific task without tool use, because of the way they tokenize things. I think a bit of prompt engineering, asking it to spell out each work or giving it the ability to run a “contains e” python function on a lot of animal names it generates or searches for solves this.

    Lots of local ai use cases I think are solvable similarly once local models get good at tool use and have the proper harness.

    • The problem with tool use is that I usually find I only need it for one component of a pipeline. So in this case mentally I would be tooling it as

      cat /usr/share/dict/words | print_if_mammal | grep -v 'e'

      but I don't know of a good way to incorporate an LLM into a pipeline like that (I know there's a Python API). What I'm actually interested in is "is this the name of a mammal?" but I don't know of the equivalent of a quiet "batch mode" at least for ollama (and of course performance).

      I guess ultimately I would want to say "write a shell utility that accepts a line from standard input and prints it to standard output if that is the name of a mammal", and then use that utility in that pipeline. Or really to have an llmfilter utility that lets you do something like

      cat /usr/share/dict/words | llmfilter "is this a mammal?" | grep -v "e"

      and now that I've said that I think I'll try to make one.

      1 reply →

Convenience trumps everything, including privacy and security.

Tell the average person that they have to install their own model is a deal breaker at the outset.

As for 99% capabilities being on device, battery life makes it a non starter.

My conspiracy theory is oai saw the writing on the wall and the massive gpu commit was in part to starve the market to delay this inevitability.