Comment by DoctorOetker

3 days ago

Imagine pipelineing lots of infra-scale GPU's, naive inference would need all previous tokens to be shifted "left" or from the append-head to the end-of-memory "tail", which would require a huge amount of data flow for the whole KV cache etc. Instead of calling GPU 1 the end-of-memory and GPU N the append-head, you keep the data static and let the role rotate like a circular buffer. So now for each new token inference round, the previous rounds end-of-memory GPU becomes the new append-head GPU. The highest bandwidth is keeping data static.

0 comments

DoctorOetker

No comments yet

Contribute on Hacker News ↗