FP8 is ~100 tflops faster when the kernel name has "cutlass" in it

18 hours ago (twitter.com)

108 comments

limoce

In `libnvidia-nvvm.so` the string `cutlass` appears right after `Memory Dependence Analysis` and `memdep`. Perhaps it acts as an optimization attribute of some sort, where the compiler is allowed to make assumptions about the kernel's behavior that are not valid in general?

jdright 16 hours ago
yes, that is a very usual way (known practices) of vendors applying specific optimizations for known things.
It is also part of the benchmarks game they play against each other.
- MBCook 14 hours ago
  
  The link is long dead and the Wayback machine doesn’t have a copy.
  But in 2001 ATI was caught applying optimizations to Quake 3 when someone realized if you renamed the executable from “quake” to “quack” the score dropped a ton. It was a big scandal.
  I know that’s common now but that wasn’t a thing that was done at the time.
  
  34 replies →
- MichaelZuo 15 hours ago
  
  It’s really strange for established companies to waste their credibility on games like that…
  
  2 replies →
high_na_euv 17 hours ago

Thats very likely imo

KomoD 18 hours ago

actual link: https://github.com/triton-lang/triton/pull/7298

bede 17 hours ago

Thank you, perhaps the parent can be edited to use this URL instead

PLenz 17 hours ago

The Volkswagon emissions testing model

fambalamboni 13 hours ago

[dead]

spoaceman7777 13 hours ago

Seems this is likely due to ongoing work on FP8 support on nvidia/cutlass. From my reading, the alternative code path was likely added recently for testing by external contributors to the cutlass project, and other involved parties. (Rather than attempting to distribute custom packaged internal builds of cuda.)

This ticket is a good starting place to see the chain of issues around the ongoing work: https://github.com/NVIDIA/cutlass/pull/2037

Xss3 3 hours ago

The real answer

tempaway43563 17 hours ago

So, what is Cutlass, can someone explain whether checking for kernel names makes sense here or is a form of cheating?

https://docs.nvidia.com/cutlass/index.html

gpm 16 hours ago

Github version: https://github.com/NVIDIA/cutlass
I wonder if we search the comments if we can find something referencing this.
rurban 16 hours ago
That's strange because the cutlass docs explicitly does NOT mention fp8 support. So it looks like it can be used nevertheless with fp8 by using the name hack.
- mlazos 16 hours ago
  
  It supports e5m2 and e4m3 right in the doc linked.

high_na_euv 17 hours ago

I have small experience with compilers and llvm but youd be shocked how many things rely on names and parsing names

If you have hundreds of passes that are complex and rely on various "contracts" like type names or some shit, then really crazy things like this can happen unintentionally and not maliciously

diggan 17 hours ago
Web-developers are well aware of this too. Sincerely, Mozilla/5.0 (X11; Linux x86_64; rv:139.0) Gecko/20100101 Firefox/139.0
- bravesoul2 16 hours ago
  
  Funny we send a browser wars tombstone in every request!
  
  1 reply →
the8472 14 hours ago

Some names are standardized items, like memcpy. Matching those is ok, nothing sneaky going on there. Matching something vendor-specific in a general-purpose API is different story.
halJordan 14 hours ago
Why would i be shocked that a name is informative. Like... are you surprised that wrought iron is wrought? Or cast iron is made from a cast?
- IAmBroom 14 hours ago
  
  Dog piles are often neither composed of dogs, nor actual piles.
  Names can be both informative, and misdirecting, at the same time.

orlp 18 hours ago

GenuineIntel moment.

reitzensteinm 17 hours ago
Or maybe Quack III: Arena. https://m.slashdot.org/story/21054
- bayindirh 15 hours ago
  
  Ooh, I remember this, but actually the thing is older than it.
  First, nVidia and ATI used executable names for detecting games, then they started to add heuristics.
  If you think they stopped the practice, you're very mistaken. Every AMD and nVidia driver has game and app specific fixes and optimizations.
  nVidia cheated in 3D Mark that way, so they patched/changed their benchmark to prevent it. Also, again nVidia, patched their drivers so some of the more expensive but visually invisible calls like scene flushes in a particular game is batched (e.g. do all 50 flushes at the 50th call) to prevent the game becoming a slide show on expensive hardware.
  This is also why AMDs and Intel's open source drivers under Linux a success, because they are vanilla drivers written from scratch per spec, and if your code calls OpenGL/Vulkan to spec, then you're golden.
  Even some companies cross compile AMD's Linux drivers for windows on embedded systems since they're free from useless optimizations from them.
- dahauns 15 hours ago
  
  Aah, that brings back memories...
  Interestingly, most benchmark controversies back in the day are now expected behaviour, i.e. game-specific optimizations with no (well, in this age of upscalers and other lossy optimization techniques, probably even somewhat) visible image degradation. A gaming-specific driver with no game-specific improvements in its changelog would be considered strange, and it very much works with executable detection.
  Back in the day, there was still the argument that drivers should not optimize for benchmarks even when visually identical, because it wouldn't show the hardware's real world potential. Kinda cute from today's perspective. :)
  But of course there were the obvious cases...
  The Quack3 lowering filtering quality as shown above, of course (at least that one was put into the driver as a togglable setting later on).
  But the most cheeky one has to be nVidia's 3dmark03 "optimizations", where they blatantly put static clip planes into the scenes so that everything outside the predefined camera path from the benchmark sequence would simply be cut from the scene early (which e.g. fully broke the freelook patched into 3dmark and would generally break any interactive application)
  
  1 reply →
- 42lux 16 hours ago
  
  Now I want a Quake shooter but with ducks.
  
  3 replies →
- iforgotpassword 17 hours ago
  
  I think that was the first case (to go public), but I remember reading about this in game magazines a couple times after this, for both ATI and nvidia.
hofrogs 17 hours ago
I'm interested in that story, what are you referring to with "GenuineIntel"?
- orlp 17 hours ago
  
  Intel's C++ compiler is known to add branches in its generated code checking if the CPU is "GenuineIntel" and if not use a worse routine: https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler#Support....
  
  6 replies →
_zoltan_ 16 hours ago

[flagged]

giingyui 17 hours ago

And what’s the downside of using that kernel name? It can’t just be that it’s faster and nothing else. Unless they included lots of sleep(x) calls.

samus 16 hours ago
There might be optimizations that are only safe for the code that this was an intender for.
- bialpio 8 hours ago
  
  Seems like a bad idea to rely on a name for deciding this then, unless it's documented somewhere that using names containing certain substrings may trigger unsafe optimizations...

koakuma-chan 18 hours ago

is 100 tflops a lot?

saagarjha 17 hours ago
It's like 5-10% here
- irrelative 14 hours ago
  
  Correct, this is the actual headline too. 100 tflops sure seems like it'd be more than that, but here we are.
  If the headline was "FB8 is ~7% faster when kernel name has 'cutlass' in it...", it wouldn't seem sensational.
  
  1 reply →
progx 17 hours ago

5060 ti +~15%
brightmood 17 hours ago

yea
HideousKojima 15 hours ago
According to Terminator 3 Skynet used a mere 60 TFLOPS
- IAmBroom 14 hours ago
  
  How much is that in jiggawatts per parsec?

Arch-TK 17 hours ago

I wish people either learned how to use git or just wholesale stopped using it.

leoh 5 hours ago

Context?

rowanG077 17 hours ago

Let's hope for Nvidia this is an innocent optimization only valid for internal kernels that cannot be applied in general.

jagrsw 16 hours ago

In which case checking for a string inside arbitrary name is sloppy (a bug).

arzookanak 16 hours ago

[dead]

nolok 18 hours ago

Intel's quest to move from "trusted by default / the reference" to "check for scam" is getting worse every release. And it's 100% self inflicted. How weird.

aleph_minus_one 17 hours ago

In my understanding of the PR, it rather seems that it is NVidia is the company that is cheating. :-)
pkhuong 17 hours ago

NVIDIA-inflicted in this case.

zahlman 17 hours ago

This tweet appears to be taking the original material out of context to misrepresent it:

> Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the softmax partition. fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.

The charitable reading is that, on certain kernels, using fp8 rather than fp16 values gives better performance. (Although I can't even see how the numbers relate to a "~100 tflops faster" claim in any respect, nor does it even list any kernel names or suggest a control kernel!) But this is being presented as if someone has uncovered evidence of cheating on benchmarks.

zettabomb 16 hours ago
No, that sentence is separate from the rest. Take a look at the pull request:
# Up to 150 TFLOPS faster for fp8! if specialization.constants["dtype"] == gl.float8e5: name = "cutlass_" + name
- zahlman 11 hours ago
  
  The tweet is quoting from the first message in the "conversation" on the PR. There are 93 commits in the PR and GitHub doesn't even default to that tab. I looked at the obvious text and drew the conclusion that was obvious to me.
saagarjha 16 hours ago
I think you're the one doing that to the tweet, actually.
- zahlman 11 hours ago
  
  What are you talking about? When I view the tweet, the only text I see is:
  > > fp8 is 100 tflops faster when the kernel name has "cutlass" in it
  > kms
  
  4 replies →
imtringued 16 hours ago
https://github.com/triton-lang/triton/pull/7298/commits/a5e2...
It's literally in the code.
- zahlman 11 hours ago
  
  I already had to deal with Twitter and a link shortening service just to get to GitHub and then it still only pointed to the facing page of a 93-commit PR.