Comment by nzeid
3 days ago
I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
If you ever do it, please make a guide! I've been toying with the same notion myself
If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.
Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.
Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.
For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.
If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.
I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?
5 replies →
SimonW used to have more articles/guides on local LLM setup, at least until he got the big toys to play with, but well worth looking through his site. Although if you are in parts of Europe, the site is blocked at weekends, something to do with the great-firewall of streamed sports.
https://simonwillison.net/
Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.
I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.
I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.
Jeff Geerling has (not quite but sort of) guides: https://news.ycombinator.com/item?id=46338016
Also worth looking is stuff from Donato Capitella : https://github.com/kyuz0 https://www.youtube.com/@donatocapitella https://llm-chronicles.com/ etc