Comment by fancyfredbot
3 days ago
There is more than one way to answer this.
They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.
You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.
You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.
I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.
This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.
This ticket, finally closed after being open for 2 years, is a pretty good micocosm of this problem:
https://github.com/ROCm/ROCm/issues/1714
Users complaining that the docs don't even specify which cards work.
But it goes deeper - a valid complaint is that "this only supports one or two consumer cards!" A common rebuttal is that it works fine on lots of AMD cards if you set some environment flag to force the GPU architecture selection. The fact that this is so close to working on a wide variety of hardware, and yet doesn't, is exactly the vibe you get with the whole ecosystem.
What I don't get is why they don't at least assign a dev or two to make the poster child of this work: llama.cpp
It's the first thing anyone tries when trying to dabble in AI or compute on the gpu, yet it's a clusterfuck to get to work. A few blessed cards work, with proper drivers and kernel; others just crash, perform horribly slow, or output GGGGGGGGGGGGGG to every input (I'm not making this up!) Then you LOL, dump it and go buy nvidia et voila, stuff works first try.
It does work, I have it running on my Radeon VII Pro
6 replies →
I suspect part of it is also that Nvidia actually does a lot of things in firmware that can be upgraded. The new Nvidia Linux drivers (the "open" ones) support Turing cards from 2018. That means chips that old already do much of the processing in firmware.
AMD keeps having issues because their drivers talk to the hardware directly so their drivers are massive bloated messes, famous for pages of auto-generated register definitions. Likely it's much more difficult to fix anything.
Having worked at both Nvidia and AMD I can assure you that they both feature lots of generated header files.
Hmm that is interesting. Can you elaborate what is exactly different between them?
I'm asking because I think a firmware has to directly talk to hardware through lower HAL (hardware abstraction layer), while customer facing parts should be fairly isolated in the upper HAL. Some companies like to add direct HW acces to customer interface via more complex functions (often a recipe made out of lower HAL functions), which I always disliked. I prefer to isolate lower level functions and memory space from the user.
In any case, both Nvidia and AMD should have very similar FW capabilities. I don't know what I'm missing here.
2 replies →
I've thought about this myself and come to a conclusion that your link reinforces. As I understand it most companies doing (EE)hardware design and production consider (CS) software to be a second class citizen at the the company. It looks like AMD after all this time competing with NVIDIA has not learned the lesson. That said I have never worked in hardware so I'm taking what I've heard from other people.
NVIDIA while far from perfect has always easily kept stride in software quality ahead of AMD for over 20 years. While AMD repeatedly keeps falling on their face and getting egg all over themselves again and again and again as far as software goes.
My guess is NVIDIA internally has found a way to keep the software people from feeling like they are "less than" the people designing the hardware.
Sounds easily but apparently not. AKA management problems.
This is correct but one of the reasons is the SWE at HW companies are living in their own bubble. They somehow don't follow the rest of the SW developments.
I'm a chip design engineer and I get frustrated with the garbage SW/FW team come up with, to the extent that I write my own FW library for my blocks. While doing that I try to learn the best practices, do quite a bit of research.
One other reason is, SW was only FW till not long ago, which was serving the HW. So there was almost no input from SW to HW development. This is clearly changing but some companies, like Nvidia, are ahead of the pack. Even Apple SoC team is quite HW centric compared to Nvidia.
That reeks of gross incompetence somewhere in the organization. Like a hosting company that has a customer dealing with very poor performance, over pays greatly to avoid it while the whole time nobody even thinks to check what the linux swap file is doing.
This
I had a similar (I think) experience when building LLVM from source a few years ago.
I kept running into some problem with LLVM's support for HIP code, even though I had not interest in having that functionality.
I realize this isn't exactly an AMD problem, but IIRC it was they were who contributed the troublesome code to LLVM, and it remained unfixed.
Apologies if there's something unfair or uninformed in what I wrote, it's been a while.
Geez. If I were Berkshire Hathaway looking to invest in the GPU market, this would be a major red flag in my fundamentals analysis.
It is a little bit more complicated than ROCm simply not having support because ROCm has at a point claimed support, and they've had to walk it back painfully (multiple times). Its not a driver issue, nor a hardware issue on their side.
There has been a long-standing issue between AMD and its mainboard manufacturers. The issue has to do with features required for ROCm, namely PCIe Atomics. AMD has been unable or unwilling to hold the mainboard manufacturers to account for advertising features the mainboard does not support.
The CPU itself must support this feature, but the mainboard must as well (in firmware).
One of the reasons why ROCm hasn't worked in the past is because the mainboard manufacturers have claimed and advertised support for PCIe Atomics, and the support they've claimed has been shown to be false, and the software fails in non-deterministic ways when tested. This is nightmare fuel for the few AMD engineers tasked with ROCm.
PCIe Atomics requires non-translated direct IO to operate correctly, and in order to support the same CPU models from multiple generations they've translated these IO lines in firmware.
This has left most people that query their system to check this showing PCIAtomics is supported, while when actual tests that rely on that support are done they fail, in chaotic ways. There is no technical specification or advertising that the mainboard manufacturers provide showing whether this is supported. Even the boards with multiple x16 slots and the many technologies related to it such as Crossfire/SLI/mGPU brandings these don't necessarily show whether PCIAtomics is properly supported.
In other words, the CPU is supported, the firmware/mainboard fail with no way to differentiate between the two at the upper layers of abstraction.
All in all. You shouldn't be blaming AMD for this. You should be blaming the three mainboard manufacturers who chose to do this. Some of these manufacturers have upper end boards where they actually did do this right they just chose to not do this for any current gen mainboard costing less than ~$300-500.
Look, this sounds like a frustrating nightmare, but the way it seems to us consumers is that AMD chose to rely on poorly implemented and supported technology, and Nvidia didn't. I can't blame AMD for the poor support by motherboards manufacturers but I can and will blame AMD for relying on it.
While we won't know for sure, unless someone from AMD comments on this; in fairness there may not have been any other way.
Nvidia has a large number of GPU related patents.
The fact that AMD chose to design their system this way, in such a roundabout and brittle manner, which is contrary to how engineer's approach things, may have been a direct result of being unable to design such systems any other way because of broad patents tied to the interface/GPU.
5 replies →
There are so many hardware certification programs out there, why doesn't AMD run one to fix this?
Create a "ROCm compatible" logo and a list of criteria. Motherboard manufacturers can send a pre-production sample to AMD along with a check for some token amount (let's say $1000). AMD runs a comprehensive test suite to check actual compatibility, if it passes the mainboard is allowed to be advertised and sold with the previously mentioned logo. Then just tell consumers to look for that logo if they want to use ROCm. If things go wrong on a mainboard without the certification, communicate that it's probably the mainboard's fault.
Maybe add some kind of versioning scheme to allow updating requirements in the future
How does NVIDIA manage this issue? I wonder whether they have a very different supply chain or just design software that puts less trust in the reliability of those advertised features.
I should point out here, if nobody has already; Nvidia's GPU designs are extremely complicated compared to what AMD and Apple ship. The "standard" is to ship a PCIe card with display handling drivers and some streaming multiprocessor hardware to process your framebuffers. Nvidia goes even further by adding additional accelerators (ALUs by way of CUDA core and tensor cores), onboard RTOS management hardware (what Nvidia calls GPU System Processor), and more complex userland drivers that very well might be able to manage atomics without any PCIe standards.
This is also one of the reasons AMD and Apple can't simply turn their ship around right now. They've both invested heavily in simplifying their GPU and removing a lot of the creature-comforts people pay Nvidia for. 10 years ago we could at least all standardize on OpenCL, but these days it's all about proprietary frameworks and throwing competitors under the bus.
1 reply →
Its an open question they have never answered afaik.
I would speculate that their design is self-contained in hardware.
AIUI, AMD documentation claims that the requirement for PCIe Atomics is due to ROCm being based on Heterogeneous System Architecture, https://en.wikipedia.org/wiki/Heterogeneous_System_Architect... which allows for a sort of "unified memory" (strictly speaking, a unified address space) across CPU and GPU RAM. Other compute API's such as CUDA, OpenCL, SYCL or Vulkan Compute don't have HSA as a strict requirement but ROCm apparently does.
So .. how's Nvidia dealing with this? Or do they benefit from motherboard manufacturers doing preferential integration testing?
>This is a management and leadership problem.
It's easy (and mostly correct) to blame management for this, but it's such a foundational issue that even if everyone up to the CEO pivoted on every topic, it wouldn't change anything. They simply don't have the engineering talent to pull this off, because they somehow concluded that making stuff open source means someone else will magically do the work for you. Nvidia on the other hand has accrued top talent for more than a decade and carefully developed their ecosystem to reach this point. And there are only so many talented engineers on the planet. So even if AMD leadership wakes up tomorrow, they won't go anywhere for a looong time.
Even top tier engineers can be found eventually. The problem is if you never even start.
Of course the specific disciplines need quite an investment into the knowledge of their workers, but it isn't anything insurmountable.
I wonder what would happen if they hired John Carmack to lead this effort.
He would probably be able to attract some really good hardware and driver talent.
Carmack has been traditionally anti-AMD nad pro-Nvidia (at least regarding GPUs) in the past. I don't know if they could convince him even with all the money in the world unless they fundamentally changed everything first.
> This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.
Yes. This kind of thing is unfortunately endemic in hardware companies, which don't "get" software. It's cultural and requires (a) a leader who does Get It and (b) one of those Amazon memos stating "anyone who does not Get With The Program will be fired".
I was able to compile ollama for AMD Radeon 780M GPUs and I use it regularly on my AMD mini-PC which cost me 500$. It does require a bit more work. I get pretty decent performance with LLMs - just making a qualitative statement as I didn't do any formal testing, but I got comparable performance vibes as a NVIDIA 4050 GPU laptop I use as well.
https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M...
Same here on lenovo thinkpad 14s with AMD Ryzen™ AI 7 PRO 360 that has a Radeon 880M iGPU. Works OK on ubuntu.
Not saying it works everywhere but it wasn't even that hard to setup, comparable to cuda.
Hate the name though.
Nobody will come after you for omitting the tm
1 reply →
> There is no ROCm support for their latest AI Max 395 APU
Fire the ceo