Comment by kj4ips

7 days ago

TL;DR: It's massively easier to run a few models really fast than it is to run many different models acceptably.

They probably are using some interesting hardware, but there's a strange economy of scale when serving lots of requests for a small number of models. Regardless of if you are running single GPU, clustered GPU, FPGAs, or ASICs, there is a cost with initializing the model that dwarfs the cost of inferring on it by many orders of magnitude.

If you build a workstation with enough accelerator-accessible memory to have "good" performance on a larger model, but only use it with typical user access patterns, that hardware will be sitting idle the vast majority of the time. If you switch between models for different situations, that incurs a load penalty, which might evict other models, which you might have to load in again.

However, if you build an inference farm, you likely have only a few models you are working with (possibly with some dynamic weight shifting[1]) and there are already some number of ready instances of each, so that load cost is only incurred when scaling a given model up or down.

I've had the pleasure to work with some folks around provisioning an FPGA+ASIC based appliance, and it can produce mind-boggling amounts of tokens/sec, but it takes 30m+ to load a model.

[1] there was a neat paper at SC a few years ago about that, but I can't find it now