← Back to context

Comment by gfysfm

2 months ago

Hi, I wrote the post! Also not a ML researcher, just an interested engineer, so I'm sure I got some things wrong.

> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.

What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).

> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.

As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.

> The post has no numbers on the time to first token for any of the three providers.

I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.