← Back to context

Comment by novaRom

1 day ago

Do you have a guess why your code is so much slower than torch? I didn't look, but there must be no reason to have 2x slower code esp. for a simple grid of FMAs.

Yes, because it has many separate kernels instead of aggressive merges like PyTorch (with Torch Compile). Each pass (norm, matmul, residual, RoPE, etc.) launches its own kernel, which increases launch overhead and memory traffic. CuBLAS helps, but it's not enough to compensate.