Comment by zahlman
1 day ago
This tweet appears to be taking the original material out of context to misrepresent it:
> Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the softmax partition. fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.
The charitable reading is that, on certain kernels, using fp8 rather than fp16 values gives better performance. (Although I can't even see how the numbers relate to a "~100 tflops faster" claim in any respect, nor does it even list any kernel names or suggest a control kernel!) But this is being presented as if someone has uncovered evidence of cheating on benchmarks.
No, that sentence is separate from the rest. Take a look at the pull request:
The tweet is quoting from the first message in the "conversation" on the PR. There are 93 commits in the PR and GitHub doesn't even default to that tab. I looked at the obvious text and drew the conclusion that was obvious to me.
I think you're the one doing that to the tweet, actually.
What are you talking about? When I view the tweet, the only text I see is:
> > fp8 is 100 tflops faster when the kernel name has "cutlass" in it
> kms
And it includes a link to show that this is the context it came from.
3 replies →
https://github.com/triton-lang/triton/pull/7298/commits/a5e2...
It's literally in the code.
I already had to deal with Twitter and a link shortening service just to get to GitHub and then it still only pointed to the facing page of a 93-commit PR.