Comment by winternewt

2 days ago

And if you don't want to buy a Mac? A 80 GB NVidia GPU costs $10,000K (equivalent to 30 years of ChatGPT Plus subscription) and will probably be obsolete in 5-7 years anyway. What are my options if I want a decent coding agent at a reasonable price?

26 comments

winternewt

timschmidt 2 days ago

I'm able to run the Unsloth quants on an ancient dual socket Xeon 1U server I keep around for homelab stuff. It has 8 DDR3 channels, which gives me about as much memory bandwidth as two channels of DDR5 :-/ But 16 sockets and cheaper prices. So it has 256gb in it right now. I have to run the minimum size Unsloth quant for the largest open weight models. They definitely feel a bit dazed. This machine can support up to 1.5TB of DDR3, which would allow me to run many of the largest models unquantized, but at 1/4 of the already abysmal speeds I see of ~ 1 Token / s which is only really usable with multiple agents running a kanban style async development process. Nothing interactive. That said, I picked up the hardware at the local surplus for $25 and it's vintage ~2010. Pretty impressive what this enterprise gear can do.

Power consumption? Don't ask. A subscription is cheaper.

paganel 2 days ago
> Power consumption
That’a the thing, at the end of it all power consumption will matter more for the end-user who doesn’t have money to burn away, because I suspect that power-consumption will, in the majority of cases, exceed the price of the HW itself in a matter of just a few months of intense use, let’s say a year.
- timschmidt 1 day ago
  
  Assuming models of a fixed size continue to improve in capability, continued advancement in semiconductors and optimization will reduce power consumption and/or improve performance over time. And used equipment will always approach the scrap price eventually. For me today, on scrap equipment, I get about 4 tokens / watt-hour, which is nominally ~$0.17 US but could run $0.40 after all the taxes and fees and surcharges. $0.10 / token. Ouch.
  If I were to try to purpose build a rig for it, I would get an engineering sample Epyc/motherboard/ram combo from Aliexpress with 12 channels of DDR5 and as few cores as allowed me to still use all the memory bandwidth, and I'd run it at the lowest possible power and voltage settings with aggressive ram timings. A system like that can draw 1/3 of what my scrap rig draws, at full load. And has similar memory bandwidth to a high end Mac or GPU allowing it to crank out 5 - 10 Tokens / s on the largest models, which works out to 1/3 of a penny to 2/3 of a penny per token. But either way, Epyc or Mac is going to set you back $10k or more. Hopefully in a few years when they are scrap though...

zepearl 1 day ago

I downloaded Ollama ( https://github.com/ollama/ollama/releases ) and experimented with a few Qwen models ( https://huggingface.co/Qwen/collections ).

My performance when using an RTX 5070 12GiB VRAM, Ryzen 7 9700X 8 cores CPU, 32GiB DDR5 6000MT (2 sticks):

  - "qwen2.5:7b": ~128 tokens/second (this model fits 100% in the VRAM).
  - "qwen2.5:32b": ~4.6 tokens/second.
  - "qwen3:30b-a3b": ~42 tokens/second (this is a MoE model with multiple specialized "brains") (this uses all 12GiB VRAM + 9GiB system RAM, but the GPU usage during tests is only ~25%).
  - qwen3.5:35b-a3b: ~17 tokens/second, but it's highly unstable and crashes -> currently not usable for me.

So currently my sweet spot is "qwen3:30b-a3b" - even if the model doesn't completely fit on the GPU it's still fast enough. "qwen3.5" was disappointing so far, but maybe things will change in the future (maybe Ollama needs some special optimizations for the 3.5-series?).

I would therefore deduce that the most important thing is the amount of VRAM and that performance would be similar even when using an older GPU (e.g. an RTX 3060 with as well 12GiB RAM)?

Performance without a GPU, tested by using a Ryzen 9 5950X 16 cores CPU, 128GiB DDR4 3200 MT:

  - "qwen2.5:7b": ~9 tokens/second
  - "qwen3:32b": ~2 tokens/second
  - "qwen3:30b-a3b": ~16 tokens/second

siquick 2 days ago

Rent a H100 on Modal which scales down to zero when not in use - you can set the time out period.

Cold boot times are around 5m but if your usage periods are predictable it can work out ok. Works out at $2 an hour.

Still far more expensive than a ChatGPT sub.

flyingjoe 1 day ago
Do you have some reference on what setup you're talking about? I'd like to integrate it into my IDE (cursor/vscode) - are there docs on such a setup?
- siquick 1 day ago
  
  Start here
  https://modal.com/docs/examples/vllm_inference
  or give this a go
  https://modal.com/docs/examples/opencode_server
  You get $30 free credits each month on Modal which is enough to play around (i have no affiliation, just think they run a great service)

segmondy 1 day ago

GPUs are not going obsolete anytime soon. the nvidia p40/p100 launched in 2016, 10 years ago and is popular in the local space. My first set of GPUs were a bunch of P40s from 3 years ago for $150 a piece. They at one point went up all the way to $450, but price is now down to $200 range. I think I have gotten my value from those and I suspect I'll still have them crunching out tokens for at least 3 more years. They still beat 90% of cpu/memory inference combo.

krenerd 1 day ago
Indeed, the point is that it's going for 150$
- segmondy 1 day ago
  
  My point being that no one should be buying expensive GPUs when you can pick up a few used ones to get started. But for the sake of discussion let's say you do get a blackwell pro 6000 that's now going for $10,000. I can assure you it will not be $150 10 years from now, with the falling price of dollar, demand for AI inference and hardware shortage, it might cost exactly the same 10 years from now...

Keyframe 1 day ago

What are my options if I want a decent coding agent at a reasonable price?

I'd even come from another angle.. What are my options if I want a decent coding agent, on the level of what Claude does at any given price? Let's say few tens of thousands of dollars? I've had a limited look at what's available to be run locally and nothing is on par.

renewiltord 1 day ago
Does not exist AFAIK. Even other labs struggle with Claude level performance in real world task. My experience is that no open model is close. You can get RTX 6000 Pro Blackwell (Max-Q is better for power is half). I have heard good things about Qwen3 coder next but I could not get tool calling to be high performance but it’s likely to be pebkac.
If you want to spend big bucks get h200 141 GB but honestly RTX 6000 pro is good enough till you know what you want. Workstation edition is good. It takes care of cooling etc.
Tbh even better is to just get model through cloud. If you want you can rent GPU. Then see if it’s what you want.
- Keyframe 1 day ago
  
  The gist of it is no matter the money you spend on hardware, you will not get the same quality you get from claude. Main question is then what can you run that's good enough? I haven't tested all there is available, but everything I did see does not come even close.

atwrk 2 days ago

A Strix Halo with 128GB unified memory is less than $2k and the more suitable alternative to a mac. I'm pretty happy with my device (Bosgame M5).

segmondy 1 day ago
the macs outperform it and I figure it's a better general purpose computer than strix halo. if budget is a problem, then a strix halo is a decent alternative.
- atwrk 1 day ago
  
  Well a mac isn't really an alternative to a mac, or is it? ;)
  Personally I'm not interested in having a mac as I work with linux. And yes, they outperform them, but only if you ignore the price. When comparing what you get for ~$2k, a Strix Halo is miles ahead.
- pimeys 1 day ago
  
  Mac doesn't run Linux so in my books is a worse general purpose computer than a Strix Halo box.
Keyframe 1 day ago
A Strix Halo with 128GB unified memory is less than $2k
Where did you get that price? Wherever I looked it's around 3k euros which is around $3.5k
- atwrk 1 day ago
  
  Directly from Bosgame.com, for ~1.7k€ in December. I see it's at $2.2k / 1.9k€ now.
  
  2 replies →
rookonaut 2 days ago
Can you elaborate more on your use cases, models, setup,...?
- meta-level 1 day ago
  
  I took my setup from here: https://github.com/kyuz0/amd-strix-halo-toolboxes
  Still lot to learn, but after a while you have something like Qwen3-Coder-Next-Q8_0 running and - at least for me - it works quite well, both as ChatGPT like chat-interface using llama.cpp and as coding agent
- atwrk 1 day ago
  
  I'm not really using them for coding (only played a little bit with minimax2.1), which is probably the most common use case here.
  I mainly use them for deep work with texts and deep research. My main criterion is privacy, both for legal reasons (I'm in the EU and can't and don't want to expose customer's data to non-gdpr-compliant services) and wouldn't use US services personally either, e.g. I would never explore health related topics chatgpt or gemini for obvious reasons.
  Technically I've set it up in my office with llama.cpp and have exposed that (both chat interface and openai compatible api) with a simple wireguard tunnel behind nginx and http auth. Now I can use it everywhere. It's a small, quiet and pretty fast machine (compiling llama.cpp is around 20 seconds?), I quite like it.

khalic 2 days ago

You can rent GPUs, this comes with a security, maintenance and performance overhead, but also has a few advantages.

But right now, a Mac is the easiest way because of their memory architecture.

am17an 2 days ago

Honestly you can run this on a 16GB VRAM GPU with llama.cpp. Just try it!