Comment by jsnell
2 months ago
I'm not a ML research or engineer, so take this with a grain of salt, but I'm a bit confused by this post.
Deepseek V3/R1 are expensive to run locally because they are so big compared to the models people usually run locally. The number of active parameters is obviously lower than the full model size, but that basically just helps with the compute requirements, not the memory requirements. Unless you have multiple H100s lying around, V3/R1 are only run locally as impractical stunts with some or all the model being stored on low bandwidth memory.
We can't compare the size of Deepseek V3 to that of any proprietary frontier models because we don't know the size of those models at all (or even their architecture). The models being compared to that are "expensive at scale" you can't run locally at all, but surely we have no reason to believe that they'd somehow be cheap to run locally?
But I thought you'd typically expect exactly the opposite effect than is claimed here? MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
> Bigger batches raise latency because user tokens might be waiting up to 200ms before the batch is full enough to run, but they boost throughput by allowing larger (and thus more efficient) GEMMs in the feed-forward step
Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute. The matrices are already sharded to a much smaller size than the size of the entire model or even layer. So you'll basically load some slice of the weights from the HBM to SRAM, do the multiplication for that slice, and then aggregate the results once all tiles have been processed. Batching lets you do multiple separate computations with the same weights, meaning you get more effective FLOPS per unit of memory bandwidth.
> The fact that OpenAI and Anthropic’s models are quick to respond suggests that either:
Is that actually a fact? The post has no numbers on the time to first token for any of the three providers.
Hi, I wrote the post! Also not a ML researcher, just an interested engineer, so I'm sure I got some things wrong.
> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).
> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.
As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.
> The post has no numbers on the time to first token for any of the three providers.
I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.