Comment by gaeld
31 minutes ago
Totally, though DTP is not required for these kind of speeds. Standard TP works also.
DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.
For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.
No comments yet
Contribute on Hacker News ↗