Comment by coder543
2 days ago
Yes, you can offload random experts to the GPU, but it will still be activating experts that are on the CPU, completely tanking performance. It won't suddenly make things fast. One of these GPUs is not enough for this model.
You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.
If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.
> If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.
Absolutely, same if they get a $10K Mac/Apple computer, immense disappointment ahead.
Best is of course to start looking at models that fit within 96GB, but that'd make too much sense.
$10k is > 4 years of a $200/mo sub to models which are currently far better, continue to get upgraded frequently, and have improved tremendously in the last year alone.
This almost feels like a retro computing kind of hobby than anything aimed at genuine productivity.
I don't think the calculation is that simple. With your own hardware, there literally is no limits of runtime, or what models you use, or what tooling you use, or availability, all of those things are up to you.
Maybe I'm old school, but I prefer those benefits over some cost/benefit analysis across 4 years which by the time we're 20% through it, everything has changed.
But I also use this hardware for training my own models, not just inference and not just LLMs, I'd agree with you if we were talking about just LLM inference.
They are better in some ways, but they're also neutered.