Comment by echelon
9 hours ago
8xH100 is pretty wild for a single inference node.
Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?
At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.
Is my math wrong?
This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.
As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.