Comment by timschmidt

1 day ago

I'm able to run the Unsloth quants on an ancient dual socket Xeon 1U server I keep around for homelab stuff. It has 8 DDR3 channels, which gives me about as much memory bandwidth as two channels of DDR5 :-/ But 16 sockets and cheaper prices. So it has 256gb in it right now. I have to run the minimum size Unsloth quant for the largest open weight models. They definitely feel a bit dazed. This machine can support up to 1.5TB of DDR3, which would allow me to run many of the largest models unquantized, but at 1/4 of the already abysmal speeds I see of ~ 1 Token / s which is only really usable with multiple agents running a kanban style async development process. Nothing interactive. That said, I picked up the hardware at the local surplus for $25 and it's vintage ~2010. Pretty impressive what this enterprise gear can do.

Power consumption? Don't ask. A subscription is cheaper.

2 comments

timschmidt

paganel 1 day ago

> Power consumption

That’a the thing, at the end of it all power consumption will matter more for the end-user who doesn’t have money to burn away, because I suspect that power-consumption will, in the majority of cases, exceed the price of the HW itself in a matter of just a few months of intense use, let’s say a year.

timschmidt 1 day ago

Assuming models of a fixed size continue to improve in capability, continued advancement in semiconductors and optimization will reduce power consumption and/or improve performance over time. And used equipment will always approach the scrap price eventually. For me today, on scrap equipment, I get about 4 tokens / watt-hour, which is nominally ~$0.17 US but could run $0.40 after all the taxes and fees and surcharges. $0.10 / token. Ouch.
If I were to try to purpose build a rig for it, I would get an engineering sample Epyc/motherboard/ram combo from Aliexpress with 12 channels of DDR5 and as few cores as allowed me to still use all the memory bandwidth, and I'd run it at the lowest possible power and voltage settings with aggressive ram timings. A system like that can draw 1/3 of what my scrap rig draws, at full load. And has similar memory bandwidth to a high end Mac or GPU allowing it to crank out 5 - 10 Tokens / s on the largest models, which works out to 1/3 of a penny to 2/3 of a penny per token. But either way, Epyc or Mac is going to set you back $10k or more. Hopefully in a few years when they are scrap though...