← Back to context

Comment by t55

10 months ago

Triton sits between CUDA and PyTorch and is built to work smoothly within the PyTorch ecosystem. In CUDA, on the other hand, you can directly manipulate warp-level primitives and fine-tune memory prefetching to reduce latency in eg. attention algorithms, a level of control that Triton and PyTorch don't offer AFAIK.

MLIR extensions for Python do though, as far as I could tell from LLVM developer meeting.