Comment by freehorse

2 months ago

> mixture of experts requires higher batch sizes

Or apple silicon for low batch size (=1 ideally). The unified memory allows for running larger models on the expense of them running slower, because of lower bandwidth/flops than a normal gpu. But MoEs require computing only few parameters every time, so the computational needs are low. I have seen people reporting decent speeds for deepseek for single batch inference on macs. It is still expensive though to many people's standards because it requires a lot of $$$ to get enough memory.

In some ways, MoE models are perfect fit for macs (or any similar machines that may come out). In contrast, ordering a mac with upgraded ram size and running dense models that just fit in the vram can be very painful.