Comment by echelon

9 hours ago

8xH100 is pretty wild for a single inference node.

Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?

At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.

Is my math wrong?

2 comments

echelon

This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.

As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.