Comment by DoctorOetker
6 days ago
Due to batching, inference is profitable, very profitable.
Yet undoubtedly they are making what is declared a loss.
But is it really a loss?
If you buy an asset, is that automatically a loss? or is it an investment?
By "running at a loss" one can build a huge dataset, to stay in the running.
How batched can it really be though if every request is personalised to the user with Memory?
Imagine pipelineing lots of infra-scale GPU's, naive inference would need all previous tokens to be shifted "left" or from the append-head to the end-of-memory "tail", which would require a huge amount of data flow for the whole KV cache etc. Instead of calling GPU 1 the end-of-memory and GPU N the append-head, you keep the data static and let the role rotate like a circular buffer. So now for each new token inference round, the previous rounds end-of-memory GPU becomes the new append-head GPU. The highest bandwidth is keeping data static.