← Back to context

Comment by pama

1 year ago

What is the fastest documented way so far to serve the full R1 or V3 models (Q8, not Q4) if the main purpose is inference with many parallel queries and maximizing the total tokens per sec? Did anyone document and benchmark efficient distributed service setups?

The top comment in this thread mentions a 6k setup, which likely could be used with vLLM with more tinkering. AFAIK vLLM‘s batched inference is great.

You need enough VRAM to hold the whole thing plus context. So probably a bunch of H100s, or MI300s.