Comment by ladberg
4 days ago
Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there
4 days ago
Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there
No the compiled version is actually faster.
From that table, the A100 tok/sec (larger is faster) numbers are:
- Eager: 28
- Compiled: 128
And
- KV cache eager: 26
- KV cache compiled: 99
The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly
Ah yep read the labels backwards and meant that - ty for catching and for the explanation