Comment by ashvardanian
6 hours ago
Agreed! I was looking through the summation example < https://github.com/tracel-ai/cubecl/blob/main/examples/sum_t...> and it seems like the primary focus is on the more traditional pre-2018 GPU programming without explicit warp-level operations, asynchrony, atomics, barriers, or countless tensor-core operations.
The project feels very nice and it would be great to have more notes in the README on the excluded functionality to better scope its applicability in more advanced GPGPU scenarios.
We support warp operations, barriers for Cuda, atomics for most backends, tensor cores instructions as well. It's just not well documented on the readme!
CubeCL is the computation backend for Burn (https://burn.dev/) - ML framework done by the same team which does all the tensor magic like autodiff, op fusion and dynamic graphs.