Comment by atty

7 days ago

The idea that a GPU is dedicated to a single inference task is just generally incorrect. Inputs are batched, and it’s not a single GPU handling a single request, it’s a handful of GPUs in various parallelism schemes processing a batch of requests at once. There’s a latency vs throughput trade off that operators make. The larger that batch size the greater the latency, but it improves overall cluster throughput.