Comment by stefan_

18 hours ago

The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.

I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.

Why do you think batching has anything to do with the model getting dumber? Do you know what batching means?

That's why I'd love to get stats on load/hardware/location of where my inference is running. Looking at you Trainiuim.