Comment by pama

1 year ago

What is the fastest documented way so far to serve the full R1 or V3 models (Q8, not Q4) if the main purpose is inference with many parallel queries and maximizing the total tokens per sec? Did anyone document and benchmark efficient distributed service setups?

2 comments

pama

manmal 1 year ago

The top comment in this thread mentions a 6k setup, which likely could be used with vLLM with more tinkering. AFAIK vLLM‘s batched inference is great.

snovv_crash 1 year ago

You need enough VRAM to hold the whole thing plus context. So probably a bunch of H100s, or MI300s.