Comment by TeMPOraL
1 year ago
To add another datapoint, I've been running the 131GB (140GB on disk) 1.58 bit dynamic quant from Unsloth with 4k context on my 32GB Ryzen 7 2700X (8 cores, 3.70 GHz), and achieved exactly the same speed - around 0.15 tps on average, sometimes dropping to 0.11, tps occasionally going up to 0.16 tps. Roughly 1/2 of your specs, roughly 1/2 smaller quant, same tps.
I've had to disable the overload safeties in LM Studio and tweak with some loader parameters to get the model to run mostly from disk (NVMe SSD), but once it did, it also used very little CPU!
I tried offloading to GPU, but my RTX 4070 Ti (12GB VRAM) can take at most 4 layers, and it turned out to make no difference in tps.
My RAM is DDR4, maybe switching to DDR5 would improve things? Testing that would require replacing everything but the GPU, though, as my motherboard is too old :/.
More channels > faster ram.
Some math:
DDR5 6000 is 3000mhz x 2 (double data rate) x 64 bits / 8 for bytes = 48000 /1000 = 48GB/s
DDR3 1866 is 933mhz x 2 x 64 / 8 / 1000 = 14.93GB/s. If you have 4 channels that is 4 x 14.93 = 59.72GB/s
For a 131GB model, the biggest difference would be to fit it all in RAM, eg get 192GB of RAM. Sorry if this is too obvious, but it's pointless to run an llm if it doesn't fit in ram, even if it's an MOE model. And also obviously, it may take a server motherboard and cpu to fit that much RAM.
I wonder if one could just replicate the "Mac mini LLM cluster" setup over Ethernet of some form and 128GB per node of DDR4 RAM. Used DDR4 RAM with likely dead bits are dirt cheap, but I would imagine that there will be challenges linking systems together.