Comment by zahlman

1 day ago

This tweet appears to be taking the original material out of context to misrepresent it:

> Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the softmax partition. fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.

The charitable reading is that, on certain kernels, using fp8 rather than fp16 values gives better performance. (Although I can't even see how the numbers relate to a "~100 tflops faster" claim in any respect, nor does it even list any kernel names or suggest a control kernel!) But this is being presented as if someone has uncovered evidence of cheating on benchmarks.

No, that sentence is separate from the rest. Take a look at the pull request:

    # Up to 150 TFLOPS faster for fp8!
    if specialization.constants["dtype"] == gl.float8e5:
        name = "cutlass_" + name

  • The tweet is quoting from the first message in the "conversation" on the PR. There are 93 commits in the PR and GitHub doesn't even default to that tab. I looked at the obvious text and drew the conclusion that was obvious to me.