← Back to context

Comment by ryao

2 months ago

Ask Eric to consider writing a new blog post discussing the state of LLM training on AMD hardware. I would be very interested in reading what he has to say.

AMD catching up in CPUs required that they become competent at hardware development. AMD catching up in the GPGPU space would require that they become competent at software development. They have a long history of incompetence when it comes to software development. Here are a number of things Nvidia has done right contrasted with what AMD has done wrong:

  * Nvidia aggressively hires talent. It is known for hiring freshly minted PhDs in areas relevant to them. I heard this firsthand from a CS professor whose specialty was in compilers who had many former students working for Nvidia. AMD is not known for aggressive hiring. Thus, they have fewer software engineers to put on tasks.

  * Nvidia has a unified driver, which reduces duplication of effort, such that their software engineers can focus on improving things. AMD maintains separate drivers for each platform. AMD tried doing partial unification with vulkan, but it took too long to develop, so the Linux community developed its own driver and almost nobody uses AMD’s unified Vulkan driver on Linux. Instead of killing their effort and adopting the community driver for both Linux and Windows, they continued developing their driver that is mostly only used on Windows.

  * Nvidia has a unified architecture, which further deduplicates work. AMD split their architecture into RDNA and CDNA, and thus must implement the same things for each where the two overlap. They realized their mistake and are making UDNA, but the damage is done and they are behind because of their RDNA+CDNA misadventures. It will not be until 2026 that UDNA fixes this.

  * Nvidia proactively uses static analysis tools on their driver, such as coverity. This became public when Nvidia open sourced the kernel part of their Linux driver. I recall a Linux kernel developer that works on static analysis begging the amdgpu kernel driver developers to use static analysis tools on their driver, since there were many obvious issues that were being caught by static analysis tools that were going unaddressed.

There are big differences between how Nvidia and AMD do engineering that make AMD’s chances of catching up slim. That is likely to be the case until they start behaving more like Nvidia in how they do engineering. They are slowly moving in that direction, but so far, it has been too little, too late.

By the way, AMD’s software development incompetence applies to the CPU side of their business too. They had numerous USB issues on the AM4 platform due to bugs in AGESA/UEFI. There were other glitches too, such and memory incompatibilities. End users generally had to put up with it, although some AMD in conjunction with some motherboard vendors managed to fix it the issues. I had an AM4 machine that would not boot reliably with 128GB of RAM and this persisted until I replaced the motherboard with one of the last AM4 motherboards made after suffering for years. Then there was this incompetence that even affected AM5:

https://blog.desdelinux.net/en/Entrysign-a-vulnerability-aff...

AMD needs to change a great deal before they have any hope of competing with Nvidia GPUs in HPC. The only thing going for them in HPC for GPUs is that they have relatively competent GPU hardware design. Everything else about their GPUs have been a disaster. I would not be surprised if Intel manages to become a major player in the GPU market before AMD manages to write good drivers. Intel, unlike AMD, has a history of competent software development. The major black mark on their history would be the initial Windows ARC drivers, but the were able to fix a remarkable number of issues in the time since their discrete GPU launch, and have fairly good drivers on Windows now. Unlike AMD, they did not have a history of incompetence, so the idea that they fixed the vast majority of issues is not hard to believe. Intel will likely continue to have good drivers after they have made competitive hardware to pair with them, provided that they have not laid off their driver developers.

I have more hope in Intel than I have in AMD and I say that despite knowing how bad Intel is at doing anything other than CPUs. No matter how bad Intel is at branching into new areas, AMD is even worse at software development. On the bright side, Intel’s GPU IP has a dual role, since it is needed for their CPU’s iGPUs, so Intel must do the one thing they almost never do when branching into new areas, which is to iterate. The cost of R&D is thus mostly handled by their iGPUs and they can continue iterating on their discrete graphics until it is a real contender in the market. I hope that they merge Gaudi into their GPU development effort, since iterating on ARC is the right way forward. I think Intel having an “AMD moment” in GPUs is less of a longshot than AMD’s recovery from the AM3 fiasco and less of a long shot than AMD becoming competent at driver development before Intel either becomes good at GPGPU or goes out of business.

Trying to find fault over UDNA is hilarious, they literally can't win with you.

My business model is to support viable alternatives. If someone else comes along and develops something that looks viable and there is customer demand for it, I'll deploy it.

You totally lost me at having more hope with Intel. I'm not seeing it. Gaudi 3 release was a nothing burger and is only recently deployed on IBM Cloud. Software is the critical component and if developers can't get access to the hardware, nobody is going to write software for it.

  • I fixed some autocorrect typos that were in my comment. I do not find fault with UDNA and I have no idea why you think I do. I find fault with the CDNA/RDNA split. UDNA is what AMD should have done in the first place.

    As for Gaudi 3, I think it needs to be scrapped and used as an organ donor for ARC. In particular, the interconnect should reused in ARC. That would be Intel’s best chance of becoming competitive with Nvidia.

    As for AMD becoming competitive with Nvidia, their incompetence at software engineering makes me skeptical. They do not have enough people. They have the people that they do have divided into to many redundant things. They do not have their people doing good software engineering practices such as static analysis. They also work the people that they do have with long hours (or so I have read), which of course is going to result in more bugs. They need a complete culture change to have any chance of catching up to Nvidia on the software side of things.

    As for Intel, they have a good software engineering culture. They just need to fix the hardware side of things and I consider that to be much less of a stretch than AMD becoming good at software engineering Their recent battlematrix announcement is a step in the right direction. They just need to keep improving their GPUs and add an interconnect to fulfill the role of nvlink.