Comment by WildGreenLeave

9 months ago

Correct me if I'm wrong, but, if you run multiple inferences at the same time on the same GPU you will need load multiple models in the vram and the models will fight for resources right? So running 10 parallel inferences will slow everything down 5 times right? Or am I missing something?

4 comments

WildGreenLeave

Palmik 9 months ago

Inference for single example is memory bound. By doing batch inference, you can interleave computation with memory loads, without losing much speed (up until you cross the compute bound threshold).

bavell 9 months ago

You will most likely be using the same model so just 1 to load into vram.

aeternum 9 months ago

No, the key is to use the full context window so you structure the prompt as something like: For each line below, repeat the line, add a comma then output whether it most closely represents a product or service:

20 bottles of ferric chloride

salesforce

...

e12e 9 months ago

Appreciate the concrete advice in this response. Thank you.