Comment by latchkey
2 months ago
AMD is consistently stacking more HBM.
H100 80GB HBM3
H200 141GB HBM3e
B200 192GB HBM3e
MI300x 192GB HBM3
MI325x 256GB HBM3e
MI355x 288GB HBM3e
This means that you can fit larger and larger models into a single node, without having to go out over the network. The memory bandwidth on AMD is also quite good.
It really does not matter how much memory AMD has if the drivers and firmware are unstable. To give one example from last year:
https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...
They are currently developing their own drivers for AMD hardware because of the headaches that they had with ROCm.
"driver" is such a generic word. tinygrad works on mi300x. If you want to use it, you can. Negates your point.
Additionally, ROCm is a giant collection of a whole bunch of libraries. Certainly there are issues, as with any large collection of software, but the critical thing is whether or not AMD is responsive towards getting things fixed.
In the past, it was a huge issue, AMD would routinely ignore developers and bugs would never get fixed. But, after that SA article, Lisa lit a fire under Anush's butt and he's taking ownership. It is a major shift in the entire culture at the company. They are extremely responsive and getting things fixed. You can literally tweet your GH issue to him and someone will respond.
What is true a year ago isn't today. If you're paying attention like I am, and experiencing it first hand, things are changing, fast.
I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.
Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.
16 replies →
That was last year Mi300x firmware and software have gotten much better since then
Unfortunately, AMD and ATI before it have had driver quality issues for decades; and both they and their fans have claimed that they have solved the problems every year since.
Even if they have made progress, I doubt that they have reached parity with Nvidia. I have had enough false hope from them that I am convinced that the only way that they will ever improve their drivers if they let another group write the drivers for them.
Coincidentally, Valve has been developing the Vulkan driver used by SteamOS and other Linux distributions, which is how SteamOS is so much better than Windows. If AMD could get someone else to work on improving their GPGPU support, we would likely see it become quite good too. Until then, I have very low expectations.
6 replies →
So the MI300x has 8 different memory domains, and although you can treat it as one flat memory space, if you want to reach their advertised peak memory bandwidth you have to work with it like an 8-socket board.
Here is a good article on it:
https://rocm.blogs.amd.com/software-tools-optimization/compu...
MI355X isn't out yet, and the upcoming B300 also has 288GB HBM3e
June 12th.
B300 is Q4 2025.
Yes, they keep leapfrogging each other. AMD is still ahead in vram.