Comment by zahlman

1 day ago

This tweet appears to be taking the original material out of context to misrepresent it:

> Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the softmax partition. fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.

The charitable reading is that, on certain kernels, using fp8 rather than fp16 values gives better performance. (Although I can't even see how the numbers relate to a "~100 tflops faster" claim in any respect, nor does it even list any kernel names or suggest a control kernel!) But this is being presented as if someone has uncovered evidence of cheating on benchmarks.

10 comments

zahlman

zettabomb 1 day ago

No, that sentence is separate from the rest. Take a look at the pull request:

    # Up to 150 TFLOPS faster for fp8!
    if specialization.constants["dtype"] == gl.float8e5:
        name = "cutlass_" + name

zahlman 1 day ago

The tweet is quoting from the first message in the "conversation" on the PR. There are 93 commits in the PR and GitHub doesn't even default to that tab. I looked at the obvious text and drew the conclusion that was obvious to me.

saagarjha 1 day ago

I think you're the one doing that to the tweet, actually.

zahlman 1 day ago
What are you talking about? When I view the tweet, the only text I see is:
> > fp8 is 100 tflops faster when the kernel name has "cutlass" in it
> kms
- saagarjha 1 day ago
  
  And it includes a link to show that this is the context it came from.
  
  3 replies →

imtringued 1 day ago

https://github.com/triton-lang/triton/pull/7298/commits/a5e2...

It's literally in the code.

zahlman 1 day ago

I already had to deal with Twitter and a link shortening service just to get to GitHub and then it still only pointed to the facing page of a 93-commit PR.