← Back to context

Comment by zozbot234

1 day ago

> the difference is pretty categorical between what can run on a single card and what can run on a DGX GB200 NVL72 cabinet.

A better way of putting it is that you can run plenty of things on a single ordinary system, but you may be disappointed at the performance. Generally, you can't expect inference to be as quick as with cloud for SOTA-like models. You have to run smaller models for quick replies, and large models with a lot of real-world knowledge for less time-critical inference, possibly batching many requests simultaneously to improve throughput.