Comment by DoctorOetker
3 days ago
Imagine pipelineing lots of infra-scale GPU's, naive inference would need all previous tokens to be shifted "left" or from the append-head to the end-of-memory "tail", which would require a huge amount of data flow for the whole KV cache etc. Instead of calling GPU 1 the end-of-memory and GPU N the append-head, you keep the data static and let the role rotate like a circular buffer. So now for each new token inference round, the previous rounds end-of-memory GPU becomes the new append-head GPU. The highest bandwidth is keeping data static.
No comments yet
Contribute on Hacker News ↗