Comment by a_victorp

2 months ago

If you ever do it, please make a guide! I've been toying with the same notion myself

10 comments

a_victorp

If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.

Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.

Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.

For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.

If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.

le-mark 2 months ago
I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?
- hackstack 2 months ago
  
  This is essentially vast.ai, no?
  
  4 replies →

satvikpendem 2 months ago

Jeff Geerling has (not quite but sort of) guides: https://news.ycombinator.com/item?id=46338016

a96 2 months ago

Also worth looking is stuff from Donato Capitella : https://github.com/kyuz0 https://www.youtube.com/@donatocapitella https://llm-chronicles.com/ etc

whitehexagon 2 months ago

SimonW used to have more articles/guides on local LLM setup, at least until he got the big toys to play with, but well worth looking through his site. Although if you are in parts of Europe, the site is blocked at weekends, something to do with the great-firewall of streamed sports.

https://simonwillison.net/

Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.

I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.

I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.