Comment by rfoo

10 hours ago

I'd recommend having a "gemm with a twist" [0] example in the README.md instead of having an element-wise example. It's pretty hard to evaluate how helpful this is for AI otherwise.

[0] For example, gemm but the lhs is in fp8 e4m3 and rhs is in bf16 and we want fp32 accumulation, output to bf16 after applying GELU.

Agreed! I was looking through the summation example < https://github.com/tracel-ai/cubecl/blob/main/examples/sum_t...> and it seems like the primary focus is on the more traditional pre-2018 GPU programming without explicit warp-level operations, asynchrony, atomics, barriers, or countless tensor-core operations.

The project feels very nice and it would be great to have more notes in the README on the excluded functionality to better scope its applicability in more advanced GPGPU scenarios.

  • We support warp operations, barriers for Cuda, atomics for most backends, tensor cores instructions as well. It's just not well documented on the readme!

  • CubeCL is the computation backend for Burn (https://burn.dev/) - ML framework done by the same team which does all the tensor magic like autodiff, op fusion and dynamic graphs.

We don't yet support newer types like fp8 and fp4, that's actually my next project. I'm the only contributor with the hardware to actually use the new types, so it's a bit bottlenecked on a single person right now. But yes, the example is rather simplistic, should probably work on that some time once I'm done updating the feature set to Blackwell.

  • Isn't there a CPU-based "emulator" in Nvidia dev tools?

    • From what I can tell it's not accurate enough to catch a lot of errors in the real world. Maybe an illegal instruction, but not a race condition from a missing sync or a warp divergence on a uniform instruction or other potential issues like that.

One of the main author here, the readme isn't really well up-to-date. We have our own gemm implementation based on CubeCL. It's still moving a lot, but we support tensor cores, use warp operations (Plane Operations in CubeCL), we even added TMA instructions for CUDA.