Comment by nicce

17 hours ago

Most people can’t affort the GPUs for local models if you want to get close to cloud capabilities.

36 comments

nicce

A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).

A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.

You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.

This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.

The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.

iberator 17 hours ago
That's out of touch for 90% of developers worldwide
- brazukadev 15 hours ago
  
  Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.
  
  7 replies →
jen729w 17 hours ago
Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?
> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
- radicalbyte 15 hours ago
  
  > Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
  The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)
  
  1 reply →
- reaslonik 15 hours ago
  
  >I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
  As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too
  
  1 reply →
Foobar8568 17 hours ago

Yes but they are really less performant than claude code or codex. I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.
reaslonik 15 hours ago

You need to leave much more room for context if you want to do useful work besides entertainment. Luckily there are _several_ PCIe slots on a motherboard. New Nvidia cards at retail(or above) are not the only choice for building a cluster; I threw a pile of Intel Battlemage cards on it and got away with ~30% of the nvidia cost for same capacity (setup was _not_ easy in early 2025 though).
You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM
ashirviskas 17 hours ago

How much context do you get with 2GB of leftover VRAM on Nvidia GPU?
electroglyph 15 hours ago

you need a couple RTX 6000 pros to come close to matching cloud capability

s1mplicissimus 17 hours ago

Most people I know can't afford to leak business insider information to 3rd party SaaS providers, so it's unfortunately not really an option.

ruszki 15 hours ago
But… they do all the time. Almost everybody uses some mix of Office, Slack, Notion, random email providers, random “security” solutions etc. The exception is the opposite. The only thing prevents info leaking is ToS, and there are options for that even with LLMs. Nothing changed regarding that.
- Antibabelic 15 hours ago
  
  In my personal experience, it's very common for big companies to host email, messengers, conferencing software on their own servers.
  
  5 replies →
- omgmajk 15 hours ago
  
  All of those things are hosted on-prem in the bigger orgs I have worked in.
  
  1 reply →
infecto 12 hours ago
This is a poor take imo. Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint. Most entities, even those with sensitive information are legally comfortable with that arrangement. Using a LLM does not change much so long as the MSA is similar.
- s1mplicissimus 10 hours ago
  
  > Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint
  I am sure MS employees need to tell themselves that to sleep well. The statement itself doesn't seem to hold much epistemological value above that though.
  
  1 reply →

EagnaIonat 17 hours ago

The more recent LLMs work fine on an M1 mac. Can't speak for Windows/Linux.

There was even a recent release of Granite4 that runs on a Raspberry Pi.

https://github.com/Jewelzufo/granitepi-4-nano

For my local work I use Ollama. (M4 Max 128GB)

- gpt-oss. 20b or 120b depending on complexity of use cases.

- granite4 for speed and lower complexity (around the same as gpt20b).

whitehexagon 14 hours ago

Agreed, GPU is the expensive route, especially when I was looking at external GPU solutions.

Using Qwen3:32b on a 32GB M1 Pro may not be "close to cloud capabilities" but it is more than powerful enough for me, and most importantly, local and private.

As a bonus, running Asahi Linux feels like I own my Personal Computer once again.

mark_l_watson 13 hours ago

I agree with you (I have a 32G M2Pro) and I like to mix using local models running with Ollama and LM Studio with using gemini-cli (used to also occasionally use codex but I just cancelled my $20/month OpenAI subscription - I like their products but I don’t like their business model, so I lose out now on that option).
Running smaller models on Apple Silicon is kinder on the environment/energy use and has privacy benefits for corporate use.
Using a hybrid approach makes sense for many use cases. Everyone gets to make their own decisions; for me, I like to factor in externalities like social benefit, environment, and wanting the economy to do as well as it can in our new post-mono polar world.

Tepix 17 hours ago

Isn't the point that you don't need SOTA capabilities all the time?