Comment by angry_octet

2 months ago

This is a great explainer from an LLM perspective, and it would be interesting to see a computational scheduling explanation in depth. I presume that hyperscale LLM companies extensively examine the computation trace to identify bottlenecks and idle bubbles, and develop load balancers, pipeline architectures and schedulers in order to optimise their workload.

The batching requirement for efficiency makes high security applications quite difficult, because the normal technique of isolating unrelated queries would become very expensive. The nVidia vGPU GPU virtualisation time shares GPU memory, and every switch requires unload/reload context switches, doubtful they have deduplication. Multi-Instance GPU (MIG) splits GPU memory between users, but it is a fixed partitioning scheme (you have to reboot the GPU to change), and nobody wants to split their 96GB GPU into 4x24GB GPUs.

Makes me wonder what the tradeoff is for putting second level memory on the GPU board (i.e. normal DRAM), so that different matrix data can be loaded in faster than over PCIe, i.e. the HBM becomes a cache.

(I'm also really liking the honesty in the authors book on Software Engineering, not in the dry IEEE sense, but as a survival guide in a large enterprise. https://www.seangoedecke.com/book/ )