Comment by zozbot234

8 hours ago

You can batch requests when running locally too, if you're using a model with low-enough requirements for KV-cache; essentially targeting the same resource efficiencies that the big providers rely on. This is useful since it gives you more compute throughput "for free" during decode, even when running on very limited hardware.

1 comment

zozbot234

9dev 2 hours ago

That’s still orders of magnitude less efficient, and also not how most people use AI, or probably will use AI.