← Back to context

Comment by Gigachad

12 hours ago

We still aren't going to be putting 200gb ram on a phone in a couple years to run those local models.

HBF is coming fast, with the first examples expected to be sampling to users this year.

The storage technology of Flash memory can be optimized to be as fast and more energy-efficient than DRAM at large linear reads, there was just little demand before because doing so costs you ~half of your density and doesn't improve your writes at all. All the flash memory manufacturers realized that this is a huge opportunity for model weights and are now chasing this.

Or in other words, after the initial price peak stabilizes in a few years, it will be reasonable to put ~500GB of weights into a device for ~$100 in memory costs.

That amount of RAM won’t be necessary. Gemma 4 and comparably sized Qwen 3.5 models are already better than the very best, biggest frontier models were just 12-18 months ago. Now in an 18-36GB footprint, depending on quantization.

We don’t need 200gb of RAM on a phone to run big models. Just 200 GB of storage thanks to Apple’s “LLM in a flash” research.

See: https://x.com/danveloper/status/2034353876753592372

  • Yes, I agree that this is the right solution, because for a locally-hosted model I value more the quality of the output than the speed with which it is produced, so I prefer the models as they were originally trained, not with further quantizations.

    While that paper praises the Apple advantage in SSD speed, which allows a decent performance for inference with huge models, nowadays SSD speeds equal or greater than that can be achieved in any desktop PC that has dual PCIe 5.0 SSDs, or even one PCIe 5.0 and one PCIe 4.0 SSDs.

    Because I had also independently reached this conclusion, like I presume many others, I have just started to work a week ago on modifying llama.cpp to use in an optimal manner weights stored on SSDs, while also batching many tasks, so that they will share each pass through the SSDs. I assume that in the following months we will see more projects in this direction, so the local hosting of very large models will become easier and more widespread, allowing the avoidance of the high risks associated with external providers, like the recent enshittification of Claude Code.

A lot of people are making the mistake of noticing that local models have been 12-24 months behind SotA ones for a good portion of the last couple years, and then drawing a dotted line assuming that continues to hold.

It simply.. doesn't. The SotA models are enormous now, and there's no free lunch on compression/quantization here.

Opus 4.6 capabilities are not coming to your (even 64-128gb) laptop or phone in the popular architecture that current LLMs use.

Now, that doesn't mean that a much narrower-scoped model with very impressive results can't be delivered. But that narrower model won't have the same breadth of knowledge, and TBD if it's possible to get the quality/outcomes seen with these models without that broad "world" knowledge.

It also doesn't preclude a new architecture or other breakthrough. I'm simply stating it doesn't happen with the current way of building these.

edit: forgot to mention the notion of ASIC-style models on a chip. I haven't been following this closely, but last I saw the power requirements are too steep for a mobile device.

  • Don’t underestimate the march of technology. Just look at your phone, it has more FLOPS than there were in the entire world 40 years ago.

    • And I think it's very likely that with improved methods you could get opus 4.6 level performance on a wrist watch in few years.

      You needed supercomputer to win in chess until you didn't.

      Currently local models performance in natural language is much better than any algorithm running on a super computer cluster just few years ago.

    • Yeah, but that's the current state of the art after decades of aggressive optimizations, there's no foreseeable future where we'll ever be able to cram several orders of magnitude more ram into a phone.

      1 reply →

  • Would the model even need that breath of knowledge? Humans just look things up in books or on Wikipedia, which you can store on a plain old HDD, not VRAM. All books ever written fit into about 60TB if you OCR them, and the useful information in them probably in a lot less, that's well within the range of consumer technology.

  • The gap between SOTA models and open / local models continues to diminish as SOTA is seeing diminishing returns on scaling (and that seems to be the main way they are "improving"), whereas local models are making real jumps. I'm actually more optimistic local models will catch up completely than I am SOTA will be taking any great leaps forward.

  • Pretty sure there’s at least a couple orders of magnitude in purely algorithmic areas of LLM inference; maybe training, too, though I’m less confident here. Rationale: meat computers run on 20W, though pretraining took a billion years or so.

  • There's been plenty of free lunch shrinking models thus far with regards to capability vs parameter count.

    Contradicting that trend takes more than "It simply.. doesn't."

    There's plenty of room for RAM sizes to double along with bus speed. It idled for a long time as a result of limited need for more.