Comment by DiabloD3
2 months ago
You're conflating two different things.
ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.
The part of ROCm you're interested in is HIP; HIP is the part that does legacy CUDA emulation. HIP will never be complete because Nvidia keeps adding new things, documents things wrong, and also the "cool" stuff people do on Nvidia cards aren't CUDA and it is out of scope for HIP to emulate PTX (since that is strongly tied to how historical Nvidia architectures worked, and would be entirely inappropriate for AMD architectures).
The whole thing with Tinygrad's "driver" isn't a driver at all, its the infrastructure to handle card to card ccNUMA on PCI-E-based systems, which AMD does not support: if you want that, you buy into the big boy systems that have GPUs that communicate using Infinity Fabric (which it, itself, is the HyperTransport protocol over PCI-E PHY instead of over HyperTransport PHY; PCI over PCI-E has no ability to handle ccNUMA meaningfully).
Extremely few customers, AMD's or not, want to share VRAM directly over PCI-E across GPUs since most PCI-E GPU customers are single GPU. Customers that have massive multi-GPU deployments have bought into the ecosystem of their preferred vendor (ie, Nvidia's Mellanox-powered fabrics, or AMD's wall-to-wall Infinity Fabric).
That said, AMD does want to support it if they can, and Tinygrad isn't interested in waiting for an engineer at AMD to add it, so they're pushing ahead and adding it themselves.
Also, part of Tinygrad's problem is they want it available from ROCm/HIP instead of a standards compliant modern API. ROCm/HIP still has not been ported to the modern shader compiler that the AMD driver uses (ie, the one you use for OpenGL, Vulkan, and Direct family APIs), since it originally came from an unrelated engineering team that isn't part of the driver team.
The big push in AMD currently is to unify efforts so that ROCm/HIP is massively simplified and all the redundant parts are axed, so it is purely a SPIR-V code generator or similar. This would probably help projects like Tinygrad someday, but not today.
> ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.
AMD says otherwise:
> AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
https://www.amd.com/en/products/software/rocm.html
The issues involving AMD hardware not only applied to the drivers, but to the firmware below the drivers:
https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...
Tinygrad’s software looks like a userland driver:
https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...
It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.
AMD is extremely bad at communications. The driver already contains everything ROCm requires to talk to the GPU, and ROCm itself is only a SDK that contains runtimes, libraries, and compilers.
This part of TinyGrad is not a driver, however it tries to hijack the process to do part of that task. You cannot boot the system with this, and it does not replace any part of the Mesa/DRI/DRM/KMS/etc stack. It does reinitialize the hardware with a different firmware, which might be why you think this is a driver.
I consider it to be a driver, or at least part of one. Userspace drivers exist. Graphic drivers originally were entirely in userspace, until portions of them were moved into the kernel for kernel mode setting and DRM. These days, graphics drivers themselves have both kernel mode and user mode components. The shader compiler for example would be a user mode component.
2 replies →
https://community.amd.com/t5/ai/what-s-new-in-amd-rocm-6-4-b...
ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.
They were doing this before, the difference with this is, the version of ROCm you use is locked to the driver versions that are supported, which is a very narrow range.
With this new thing, the backend API is now formalized and easier to support wider range of difference.