Comment by ladberg

6 months ago

Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there

2 comments

ladberg

ModelForge 6 months ago

No the compiled version is actually faster.

From that table, the A100 tok/sec (larger is faster) numbers are:

- Eager: 28

- Compiled: 128

And

- KV cache eager: 26

- KV cache compiled: 99

The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

ladberg 6 months ago

Ah yep read the labels backwards and meant that - ty for catching and for the explanation