Comment by segmondy

1 day ago

This is a good time to promote running your own models. I have been running my own models locally and I would wager a local model will meet 85-95% of your needs if you really learn to use it. These models have gotten great. For anyone wanting to get into this, the smartest models to run recently that is consumer friendly was just released, checkout Qwen3.5 the 27B and 35B variants. They are small and I recommend running full Q8 quants. The easiest way to run these without dealing with complex GPU is to get a mac. For the example I gave, a 64gb mac will handle it well. If you are really cash strapped then you can manage with a 32gb but will have to run with less resolution quants. If you are not cashed strap, then get at least a 128gb and if possible a 256gb. The models are so good you will regret not getting a better system. You can join the r/LocalLlama community in reddit to learn some more. But this is pretty easy. Grab llama.cpp, grab a gguf quant from huggingface.co - the unsloth quants are great - https://huggingface.co/unsloth/models

68 comments

segmondy

0xbadcafebee 1 day ago

For non-Mac users:

A laptop with an iGPU and loads of system RAM has the advantage of being able to use system ram in addition to VRAM to load models (assuming your gpu driver supports it, which most do afaik), so load up as much system RAM as you can. The downside is, the system RAM is less fast than dedicated GDDR5. These GPUs would be Radeon 890M and Intel Arc (previous generations are still decently good, if that's more affordable for you).

A laptop with a discrete GPU will not be able to load models as large directly to GPU, but with layer offloading and a quantized MoE model, you can still get quite fast performance with modern low-to-medium-sized models.

Do not get less than 32GB RAM for any machine, and max out the iGPU machine's RAM. Also try to get a bigass NVMe drive as you will likely be downloading a lot of big models, and should be using a VM with Docker containers, so all that adds up to steal away quite a bit of drive space.

Final thought: before you spend thousands on a machine, consider that there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now. Do the math before you purchase a machine; unless you are doing 24/7/365 inference, the cloud is fastly more cost effective.

bjackman 1 day ago
> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now.
Oh yeah, seems obvious now you said it, but this is a great point.
I'm constantly thinking "I need to get into local models but I dread spending all that time and money without having any idea if the end result would be useful".
But obviously the answer is to start playing with open models in the cloud!
- JKCalhoun 1 day ago
  
  I agree but I still have that itch to have my own local model—so it's not always about cost. A hobby?
  (Besides, a hopped-up Mac would never go to waste in my home if it turns out the local LLM thing was not worth the cost.)
- spwa4 1 day ago
  
  Well they are doing that because of the nature of matrix multiplication. Specifically, LLM costs scale in the square length of a single input, let's call it N, but only linearly in the number of batched inputs.
  O(M * N^2 * d)
  d is a constant related to the network you're running. Batching, btw, is the reason many tools like Ollama require you to set the context length before serving requests.
  Having many more inputs is way cheaper than having longer inputs. In fact, that this is the case is the reason we went for LLMs in the first place: as this allows training to proceed quickly, batching/"serving many customers" is exactly what you do during training. GPUs came in because taking 10k triangles, and then doing almost the exact same calculation batched 1920*1080 times on them is exactly what happens behind the eyes of Lara Croft.
  And this is simplified because a vector input (ie. M=1) is the worst case for the hardware, so they just don't do it (and certainly not in published benchmark results). Often even older chips are hardwired to work with M set to 8 (and these days 24 or 32) for every calculation. So until you hit 20 customers/requests at the same time, it's almost entirely free in practice.
  Hence: the optimization of subagents. Let's say you need an LLM to process 1 million words (let's say 1 word = 1 token for simplicity)
  O(1 million words in one go) ~ 1e12 or 1 trillion operations
  O(1000 times 1000 words) ~ 1e9 or 1 billion operations
  O(10000 times 100 words) ~ 1e8 or 100 million operations
  O(100000 times 10 words) ~ 1e7 or 10 million operations
  O(one word at a time) ~ 1e6 or 1 million operations
  Of course, to an extent this last way of doing things is the long known case of a recurrent neural network. Very difficult to train, but if you get it working, it speeds away like professor Snape confronted with a bar of soap (to steal a Harry Potter joke)
asymmetric 1 day ago
> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud
Do you have some links?
Also I assume the privacy implications are vastly different compared to running locally?
- 0xbadcafebee 1 day ago
  
  Throw a rock and you'll hit one... Groq (not Grok, elon stole the name), Mistral, SiliconFlow, Clarifai, Hyperbolic, Databricks, Together AI, Fireworks AI, CompactifAI, Nebius Base, Featherless AI, Hugging Face (they do inference too), Cohere, Baseten, DeepInfra, Fireworks AI, DeepSeek, Novita AI, OpenRouter, xAI, Perplexity Labs, AI21, OctoAI, Reka, Cerebras, Fal AI, Nscale, OVHcloud AI, Public AI, Replicate, SambaNova, Scaleway, WaveSpeedAI, Z.ai, GMI Cloud, Nebius, Tensorwave, Lamini, Predibase, FriendliAI, Shadeform, Qualcomm Cloud, Alibaba Cloud AI, Poe, Bento LLM, BytePlus ModelArk, InferenceAI, IBM Wastonx.AI, AWS Bedrock, Microsoft, Google
- spiffytech 1 day ago
  
  I use Ollama Cloud. $20/mo and I never come close to hitting quota (YMMV obviously).
  They don't log anything, and they use US datacenters.
- gunalx 1 day ago
  
  for privacy preserving direct inference: Fireworks ai nebius
  otherwise openrouter for routing to lots of different providers.
- evolighting 1 day ago
  
  openrouter, for example, there are models both open and closed

winternewt 1 day ago

And if you don't want to buy a Mac? A 80 GB NVidia GPU costs $10,000K (equivalent to 30 years of ChatGPT Plus subscription) and will probably be obsolete in 5-7 years anyway. What are my options if I want a decent coding agent at a reasonable price?

timschmidt 1 day ago
I'm able to run the Unsloth quants on an ancient dual socket Xeon 1U server I keep around for homelab stuff. It has 8 DDR3 channels, which gives me about as much memory bandwidth as two channels of DDR5 :-/ But 16 sockets and cheaper prices. So it has 256gb in it right now. I have to run the minimum size Unsloth quant for the largest open weight models. They definitely feel a bit dazed. This machine can support up to 1.5TB of DDR3, which would allow me to run many of the largest models unquantized, but at 1/4 of the already abysmal speeds I see of ~ 1 Token / s which is only really usable with multiple agents running a kanban style async development process. Nothing interactive. That said, I picked up the hardware at the local surplus for $25 and it's vintage ~2010. Pretty impressive what this enterprise gear can do.
Power consumption? Don't ask. A subscription is cheaper.
- paganel 1 day ago
  
  > Power consumption
  That’a the thing, at the end of it all power consumption will matter more for the end-user who doesn’t have money to burn away, because I suspect that power-consumption will, in the majority of cases, exceed the price of the HW itself in a matter of just a few months of intense use, let’s say a year.
  
  1 reply →
zepearl 1 day ago
I downloaded Ollama ( https://github.com/ollama/ollama/releases ) and experimented with a few Qwen models ( https://huggingface.co/Qwen/collections ).
My performance when using an RTX 5070 12GiB VRAM, Ryzen 7 9700X 8 cores CPU, 32GiB DDR5 6000MT (2 sticks):
- "qwen2.5:7b": ~128 tokens/second (this model fits 100% in the VRAM). - "qwen2.5:32b": ~4.6 tokens/second. - "qwen3:30b-a3b": ~42 tokens/second (this is a MoE model with multiple specialized "brains") (this uses all 12GiB VRAM + 9GiB system RAM, but the GPU usage during tests is only ~25%). - qwen3.5:35b-a3b: ~17 tokens/second, but it's highly unstable and crashes -> currently not usable for me.
So currently my sweet spot is "qwen3:30b-a3b" - even if the model doesn't completely fit on the GPU it's still fast enough. "qwen3.5" was disappointing so far, but maybe things will change in the future (maybe Ollama needs some special optimizations for the 3.5-series?).
I would therefore deduce that the most important thing is the amount of VRAM and that performance would be similar even when using an older GPU (e.g. an RTX 3060 with as well 12GiB RAM)?
Performance without a GPU, tested by using a Ryzen 9 5950X 16 cores CPU, 128GiB DDR4 3200 MT:
- "qwen2.5:7b": ~9 tokens/second - "qwen3:32b": ~2 tokens/second - "qwen3:30b-a3b": ~16 tokens/second
siquick 1 day ago
Rent a H100 on Modal which scales down to zero when not in use - you can set the time out period.
Cold boot times are around 5m but if your usage periods are predictable it can work out ok. Works out at $2 an hour.
Still far more expensive than a ChatGPT sub.
- flyingjoe 1 day ago
  
  Do you have some reference on what setup you're talking about? I'd like to integrate it into my IDE (cursor/vscode) - are there docs on such a setup?
  
  1 reply →
segmondy 1 day ago
GPUs are not going obsolete anytime soon. the nvidia p40/p100 launched in 2016, 10 years ago and is popular in the local space. My first set of GPUs were a bunch of P40s from 3 years ago for $150 a piece. They at one point went up all the way to $450, but price is now down to $200 range. I think I have gotten my value from those and I suspect I'll still have them crunching out tokens for at least 3 more years. They still beat 90% of cpu/memory inference combo.
- krenerd 1 day ago
  
  Indeed, the point is that it's going for 150$
  
  1 reply →
Keyframe 1 day ago
What are my options if I want a decent coding agent at a reasonable price?
I'd even come from another angle.. What are my options if I want a decent coding agent, on the level of what Claude does at any given price? Let's say few tens of thousands of dollars? I've had a limited look at what's available to be run locally and nothing is on par.
- renewiltord 1 day ago
  
  Does not exist AFAIK. Even other labs struggle with Claude level performance in real world task. My experience is that no open model is close. You can get RTX 6000 Pro Blackwell (Max-Q is better for power is half). I have heard good things about Qwen3 coder next but I could not get tool calling to be high performance but it’s likely to be pebkac.
  If you want to spend big bucks get h200 141 GB but honestly RTX 6000 pro is good enough till you know what you want. Workstation edition is good. It takes care of cooling etc.
  Tbh even better is to just get model through cloud. If you want you can rent GPU. Then see if it’s what you want.
  
  1 reply →
atwrk 1 day ago
A Strix Halo with 128GB unified memory is less than $2k and the more suitable alternative to a mac. I'm pretty happy with my device (Bosgame M5).
- segmondy 1 day ago
  
  the macs outperform it and I figure it's a better general purpose computer than strix halo. if budget is a problem, then a strix halo is a decent alternative.
  
  2 replies →
- Keyframe 1 day ago
  
  A Strix Halo with 128GB unified memory is less than $2k
  Where did you get that price? Wherever I looked it's around 3k euros which is around $3.5k
  
  3 replies →
- rookonaut 1 day ago
  
  Can you elaborate more on your use cases, models, setup,...?
  
  2 replies →
khalic 1 day ago

You can rent GPUs, this comes with a security, maintenance and performance overhead, but also has a few advantages.
But right now, a Mac is the easiest way because of their memory architecture.
am17an 1 day ago

Honestly you can run this on a 16GB VRAM GPU with llama.cpp. Just try it!

AussieWog93 1 day ago

An even easier way to get into this is simply by downloading a program called LM Studio. You can mount a model and chat to it within 10-15 mins with no experience whatsoever, and no configuration at all.

That said, last time I tried local LLMs (around when gpt-oss came out) it still seemed super gimmicky (or at least niche, I could imagine privacy concerns would be a big deal for some). Very few use cases where you want an LLM but can't benefit immensely from using SOTA models like Claude Opus.

asmor 1 day ago

The financial barrier is kind of the opposite of "easy to run" to me.

As much as I love owning my stack, you'd have to use so much of this to break even vs an inference provider/aggregator with open frontier-ish models. (and personally, I want to use as little as possible)

computerex 1 day ago

As someone who desperately wants to use local models, I lament there is no way to use them on consumer hardware for serious coding work. I have a rtx 4070 super ti and I cannot run any large model with enough context and tps compared to a remote offering.

giancarlostoro 1 day ago

I have a 24GB Macbook Pro. I will note, do get the 'Pro' models, the Mac Mini and the Macbook Air do not have internal fans. The Macbook Pro has an internal fan, and the Mac Studio (bigger Mac Mini) has a fan. If you get a Mini, you might want to get one of those docks that cools the Mini. Your hardware will get very hot very quickly.

Also, because Apple in their infinite wisdom despite giving you a fan, very lazily turn it on (I swear it has to hit 100c before it comes on) and they give you zero control over fan settings, you may want to snag something like TG Pro for the Mac. I wound up buying a license for it, this lets you define at which temperature you want to run your fans and even gives you manual control.

On my 24G RAM Macbook Pro I have about 16GB of Inference. I use Zed with LM Studio as the back-end. I primarily just use Claude Code, but as you note, I'm sure if I used a beefier Mac with more RAM I could probably handle way more.

There's a few models that are interesting on the Mac with LM Studio that let you call tooling, so it can read your local files and write and such:

mistralai/mistralai-3-3b this one's 4.49GB - So I can increase my context window for it, not sure if it auto-compacts or not, have only just started testing it

zai-org/glm-4.6v-flash - This one is 7.09GB, same thing, only just started testing it.

mistralai/mistral-3-14b-reasoning - This one is 15.2GB just shy of the max, so not a TON of wiggle room, but usable.

If you're Apple or a company that builds things for Macs or other devices, please build something to help with airflow / cooling for the MBP / Mac Mini, it feels ridiculous that it becomes a 100c device I'm not so sure its great for device health if you want to use inference for longer than the norm.

I will probably buy a new Mac whenever the inference speeds increase at a dramatic enough rate. I sure hope Apple is considering serious options for increasing inference speed.

duskwuff 1 day ago
The Mac Mini does have a fan. It's very quiet, but it's there.
- giancarlostoro 1 day ago
  
  So is it just like the Pro? Do I need to buy the fan software for my wife's mini too? Ridiculous...
hypercube33 1 day ago
How are the Ryzen 395 with 128gb for running models these days?
- mikae1 1 day ago
  
  Also interested.
  
  1 reply →
gambiting 1 day ago
>> I will note, do get the 'Pro' models, the Mac Mini and the Macbook Air do not have internal fans
I have a base model M4 Mac Mini and it absolutely does have a fan inside it.
- giancarlostoro 1 day ago
  
  I must have assumed it did not, since my wife's Mini never sounded off the fan, it was hot beyond the norm to the touch, I stopped using it for inference. If the standard model Minis do have fans, I might reconsider instead of a Studio.
  
  1 reply →

elorant 1 day ago

Or you can get a strix halo from AMD. They run about $2k from various Chinese brands, or a bit more from Framework. 128GBs of unified RAM are plenty for most models, although memory bandwidth is slower than in a mac.

ddxv 1 day ago

I really hope at some point in the near future AI models shrink enough or laptops get strong enough to run AI models locally. I haven't tried in the past year, but when I did it was very slow token output + laptop was on fire to make that happen.

I've wanted to try some of the more recent 8B models for local tab completion or agentic, any experience with those kinds of smaller models?

lioeters 1 day ago

I've been running local language models on an existing laptop with 8GB GPU, currently using ministral-3:8b. It's faster than other models of similar size I used previously, fast enough that I never wait for it, rather have to scroll back to read the full output.
So far I'm using it conversationally, and scripting with tools. I wrote a simple chat interface / REPL in the terminal. But it's not integrated with code editor, nor agentic/claw-like loops. Last time I tried an open-source Codex-like thing, a popular one but I forget its name, it was slow and not that useful for my coding style.
It took some practice but I've been able to get good use out of it, for learning languages (human and programming), translation, producing code examples and snippets, and sometimes bouncing ideas like a rubber-duck method.
segmondy 1 day ago

qwen3-8b is good and if you are doing tab completion then it's more than adequate. you can get basic agentic with it, but if you really want to use a serious agent and do some serious work, then at the very least qwen3.5-27B if you have a 5090 32gb vram GPU or qwen3.5-35-a3b if you have less than 24gb. if you want to use a laptop, get a laptop with a built in gpu or igpu.
ZYZ64738 1 day ago

> NTransformer High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
untested:
https://github.com/xaskasdf/ntransformer
setopt 1 day ago

I had some luck with Ollama + Mistral Nemo models on consumer hardware, it seemed to punch above its "weight class". But it’s still far enough behind ChatGPT et al. that I couldn’t stop using that for real work.

mcv 1 day ago

I've noticed that running models locally is not necessarily easy. I'm currently trying to use Stable Diffusion with Flux2 klein 4b fp4 (because I have a normal GPU and not a specialised setup), and I can't get it to produce anything other than uneven blue.

I haven't tried pure text models, but 27B sounds painful for my system.

drivebyhooting 1 day ago

I have a lenovo workstation with 256GB ram but a weak sauce 12GB VRAM GPU. Is there any DMA trick to improve offload performance?

Macuyiko 1 day ago

Things such as AirLLM, or good old llama.cpp.
segmondy 1 day ago

use llama.cpp, you will be surprised how fast a model like qwen3.5-35b-a3b will run. that a3b means only 3B active parameter, so while infering the entire 3B will be in your GPU and you will get amazing performance. for your system, you should use the -cmoe option

2001zhaozhao 1 day ago

Isn't between Q4-Q6 the usual recommendation for quants? Can you explain the Q8 recommendation, as I was under the impression that if you can run a model at Q8, you should probably run a bigger model in Q4 instead

magicalhippo 1 day ago

There are no hard rules regarding quants, except less is better.
However models respond very differently, and there are tricks you can do like limiting quantization of certain layers. Some models can genrally behave fine down into sub-Q4 territory, while others don't do well below Q8 at all. And then you have the way it was quantized on top of that.
So either find some actual benchmarks, which can be rare, or you just have to try.
As an example, Unsloth recently released some benchmarks[1] which showed Qwen3.5 35B tolerating quantization very well, except for a few layers which was very sensitive.
edit: Unsloth has a page detailing their updated quantization method here[2], which was just submitted[3].
[1]: https://news.ycombinator.com/item?id=47192505
segmondy 1 day ago

if you can run Q8, go for it, always go for the best. matters a lot with vision models, never quantizie your kv cache, those always at f16.
you can always try evals and see if you have a q6 or q4 that can perform better than your q8. for smaller models i go q8. for bigger ones when i run out of memory I then go q6/q6/q4 and sometimes q3. i run deepseek/kimi-q4 for example.
I suggest for beginners to start with q8 so they can get the best quality and not be disappointed. it's simple to use q8 if you have the memory, choice fatigue and confusion comes in once you start trying to pick other quants...

unmole 1 day ago

The big AI labs are almost certainly selling inference below cost and burning mountains of money. With the insane increase in hardware prices, running models locally just doesn’t make any financial sense.

bjackman 1 day ago
Nobody is saying it makes "financial sense", it's about control.
I have always taken plenty of care to try and avoid becoming dependent on big tech for my lifestyle. Succeeded in some areas failed in others.
But now AI is a part of so many things I do and I'm concerned about it. I'm dependent on Android but I know with a bit of focus I have a clear route to escape it. Ditto with GMail. But I don't actually know what I'd do tomorrow if Gemini stopped serving my needs.
I think for those of us that _can_ afford the hardware it is probably a good investment to start learning and exploring.
One particular thing I'm concerned about is that right now I use AI exclusively through the clients Google picked for me, coz it makes financial sense. (You don't seem to get free bubble money if you buy tokens via API billing, only consumer accounts). This makes me a bit of a sheep and it feels bad. There's so much innovation happening and basically I only benefit from it in the ways Google chooses.
(Admittedly I don't need local models to fix that particular issue, maybe I should just start paying the actual cost for tokens).
- asmor 1 day ago
  
  Just use an open weight model like GLM-5 behind an aggregator (OpenRouter, NanoGPT) then. That is a commodity market, right now.
- juleiie 1 day ago
  
  It’s a luxury for the wealthy to be honest. At least for now. These prices are ridiculous
AussieWog93 1 day ago

Apparently inference itself is profitable, at least according to an interview I watched with Dario. They even cover the cost of training itself, if you look at it on a model-by-model basis.
The cash burn comes from models ballooning in size - they spend (as an example, not actual numbers) 100M on training + inference for the lifetime of Sonnet 3.5, make 200M from subscriptions/api keys while it's SOTA, but then have to somehow come up with 1B to train Opus 4.0.
To run some other back of the envelope calcs: GLM 4.7 Air (previous "good" local LLM) can generate ~70 tok/s on a Mac Mini. This equates to 2,200 million tokens per year.
Openrouter charge $0.40 per million tokens, so theoretically if you were using that Mac mini at 100% utilisation you'd be generating $880 per annum "worth" of API usage.
Assuming a power draw of something 50W, you're only looking at 440kWh per annum. At 20c per kWh that's $90 on power, plus $499 to get the hardware itself. Depreciate that $499 hardware cost over 3 years and you're looking at ~$260 to generate ~$880 in inference income.
segmondy 1 day ago

We are not in this thread because of finances but because of safety from oppressive governments and bad big corps. It's for you to decide the price of your own safety.
ZenoArrow 1 day ago

RAM and storage price increases due to the AI bubble have certainly made the cost of entry more expensive, but once you have the hardware, running models locally does make financial sense, especially if you have access to home solar power that is sufficient to run the hardware. You can't get much lower running cost than free.