← Back to context

Comment by zozbot234

15 hours ago

You can definitely run many requests in parallel as a single user, you just have to be OK with a significant slowdown for any single request. Cloud inference can't reach that ratio of total throughput per hardware cost since they are heavily incented to get the most expensive hardware available and to then minimize latency (and RAM occupation over time) even at the cost of throughput. Running slower inference with cheaper hardware is just not workable in a cloud setting.