Comment by mechagodzilla
12 days ago
You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec.
12 days ago
You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec.
Just keep going! 2TB of swap disk for 0.0000001 t/sec
Hang on, starting benchmarks on my Raspberry Pi.
By the year 2035, toasters will run LLMs.
On a lark a friend setup Ollama on a 8GB Raspberry Pi with one of the smaller models. It worked by it was very slow. IIRC it did 1 token/second.
I did the same, then put in 14 3090's. It's a little bit power hungry but fairly impressive performance wise. The hardest parts are power distribution and riser cards but I found good solutions for both.
I think 14 3090's are more than a little power hungry!
to the point that I had to pull an extra circuit... but tri phase so good to go even if I would like to go bigger.
I've limited power consumption to what I consider the optimum, each card will draw ~275 Watts (you can very nicely configure this on a per-card basis). The server itself also uses some for the motherboard, the whole rig is powered from 4 1600W supplies, the gpus are divided 5/5/4 and the mother board is connected to its own supply. It's a bit close to the edge for the supplies that have five 3090's on them but so far it held up quite well, even with higher ambient temps.
Interesting tidbit: at 4 lanes/card throughput is barely impacted, 1 or 2 is definitely too low. 8 would be great but the CPUs don't have that many lanes.
I also have a threadripper which should be able to handle that much RAM but at current RAM prices that's not interesting (that server I could populate with RAM that I still had that fit that board, and some more I bought from a refurbisher).
6 replies →
You get occasional accounts of 3090 home-superscalers whereas they would put up eight, ten, fourteen cards. I normally attribute this to obsessive-compulsive behaviour. What kind of motherboard you ended up using and what's the bi-directional bandwidth you're seeing? Something tells me you're not using EPYC 9005's with up to 256x PCIe 5.0 lanes per socket or something... Also: I find it hard to believe the "performance" claims, when your rig is pulling 3 kW from the wall (assuming undervolting at 200W per card?) The electricity costs alone would surely make this intractable, i.e. the same as running six washing machines all at once.
I love your skepsis of what I consider to be a fairly normal project, this is not to brag, simply to document.
And I'm way above 3 kW, more likely 5000 to 5500 with the GPUs running as high as I'll let them, or thereabouts, but I only have one power meter and it maxes out at 2500 watts or so. This is using two Xeons in a very high end but slightly older motherboard. When it runs the space that it is in becomes hot enough that even in the winter I have to use forced air from outside otherwise it will die.
As for electricity costs, I have 50 solar panels and on a good day they more than offset the electricity use, at 2 pm (solar noon here) I'd still be pushing 8 KW extra back into the grid. This obviously does not work out so favorably in the winter.
Building a system like this isn't very hard, it is just a lot of money for a private individual but I can afford it, I think this build is a bit under $10K, so a fraction of what you'd pay for a commercial solution but obviously far less polished and still less performant. But it is a lot of bang for the buck and I'd much rather have this rig at $10K than the first commercial solution available at a multiple of this.
I wrote a bit about power efficiency in the run-up to this build when I only had two GPUs to play with:
https://jacquesmattheij.com/llama-energy-efficiency/
My main issue with the system is that it is physically fragile, I can't transport it at all, you basically have to take it apart and then move the parts and re-assemble it on the other side. It's just too heavy and the power distribution is messy so you end up with a lot of loose wires and power supplies. I could make a complete enclosure for everything but this machine is not running permanently and when I need the space for other things I just take it apart, store the GPUs in their original boxes until the next home-run AI project. Putting it all together is about 2 hours of work. We call it Frankie, on account of how it looks.
edit: one more note, the noise it makes is absolutely incredible and I would not recommend running something like this in your house unless you are (1) crazy or (2) have a separate garage where you can install it.
2 replies →
And if you get bored of that, you can flip the RAM for more than you spent on the whole system!
And heat the whole house in parallel
Nice! What do you use it for?
1-2 tokens/sec is perfectly fine for 'asynchronous' queries, and the open-weight models are pretty close to frontier-quality (maybe a few months behind?). I frequently use it for a variety of research topics, doing feasibility studies for wacky ideas, some prototypy coding tasks. I usually give it a prompt and come back half an hour later to see the results (although the thinking traces are sufficiently entertaining that sometimes it's fun to just read as it comes out). Being able to see the full thinking traces (and pause and alter/correct them if needed) is one of my favorite aspects of being able to run these models locally. The thinking traces are frequently just as or more useful than the final outputs.
[dead]