← Back to context

Comment by latchkey

2 months ago

Exactly, it is not a driver.

Appreciate you diving more into my business. Yes, we are one of the few that publishes transparent pricing.

When we started, we got zero sales, for a long time. Nobody knew if these things performed or not. So we donated hardware and people like ChipsAndCheese started to benchmark and write blog posts.

We knew the hardware was good, but the software sucked. 16 or so months later, things have changed and sufficiently improved that now we are at capacity. My deep involvement in this business is exactly how I know what’s going on.

Yes, I have a business to run, but at the same time, I was willing to take the risk, when no-one else would, and deploy this compute. To insinuate that I have some sort of conflict of interest is unfair, especially without knowing the full story.

At this juncture, I don’t know what point you’re trying to make. We agree the software sucked. Tinygrad now runs on mi300x. Whatever George’s motivations were a year ago are no longer true today.

If you feel rocm sucks so badly, go the tinygrad route. Same if you don’t want to be tied to cuda. Choice is a good thing. At the end of the day though, this isn’t a reflection on the hardware at all.

I hope your business works out for you and I am willing to believe that AMD has improved somewhat, but I do not believe AMD has improved enough to be worth people’s time when Nvidia is an option. I have heard too many nightmares and it is going to take many people, including people who reported those nightmares, reporting improvements for me to think otherwise. It is not just George Hotz who reported issues. Eric Hartford has been quiet lately, but one of the last comments he made on his blog was not very inspiring:

> Know that you are in for rough waters. And even when you arrive - There are lots of optimizations tailored for nVidia GPUs so, even though the hardware may be just as strong spec-wise, in my experience so far, it still may take 2-3 times as long to train on equivalient AMD hardware. (though if you are a super hacker maybe you can fix it!)

https://erichartford.com/from-zero-to-fineturning-with-axolo...

There has been no follow-up “it works great now”.

That said, as for saying you have a conflict of interest, let us consider what a conflict of interest is:

https://en.wikipedia.org/wiki/Conflict_of_interest

> A conflict of interest (COI) is a situation in which a person or organization is involved in multiple interests, financial or otherwise, and serving one interest could involve working against another.

You run a company whose business is dependent entirely on leasing AMD GPUs. Here, you want to say that AMD’s hardware is useful for that purpose and no longer has the deluge of problems others reported last year. If it has not improved, saying such could materially negatively impact your business. This by definition is a conflict of interest.

That is quite a large conflict of interest, given that it involves your livelihood. You are incentivized to make things look better than they are, which affects your credibility when you say that things are fine after there has been ample evidence in the recent past that they have not been. In AMD’s case, poor driver quality is something that they inherited from ATI and the issues goes back decades. While it is believable that AMD has improved their drivers, I find it difficult to believe that they have improved them enough that things are fine now, given history. Viewing your words as being less credible because of these things might be unfair, but there have been plenty of people whose livelihoods depended on things working before you that outright lied about the fitness of products. They even lied when people’s lives were at risk:

https://hackaday.com/2015/10/26/killed-by-a-machine-the-ther...

You could be correct in everything you say, but I have good reason to be skeptical until there has been information from others corroborating it. Blame all of the people who were in similar positions to yours that lied in the past for my skepticism. That said, I will keep my ears open for good news from others who use AMD hardware in this space, but I have low expectations given history.

  • Funny to see you quoting Eric, he’s a friend and was just running on one of our systems. AMD bought credits from us and donated compute time to him as part of the big internal changes they’re pushing. That kind of thing wouldn’t have happened a year ago. And from his experience, the software has come a long way. Stuff is moving so fast, that you aren't even keeping up, but I am the one driving it forward.

    https://news.ycombinator.com/item?id=44154174

    And sigh, here we are again with the conflict of interest comments, as if I don’t get it. As I said, you don’t know the full story, so let me spell it out. I’m not doing this for money, status, or fame. I’m fortunate enough that I don’t need a job, this isn’t about livelihood or personal gain.

    I’m doing this because I genuinely care about the future of this industry. I believe AI is as transformational as the early Internet. I’ve been online since 1991 (BBS before that), and I’ve seen how monopolies can strangle innovation. A world where one company controls all AI hardware and software is a terrible outcome. Imagine if Cisco made every router or Windows was the only OS. That’s where we’re headed with Nvidia, and I refuse to accept that.

    Look at my history and who my investor is, this isn’t some VC land grab. We truly care about decentralizing and democratizing compute. Our priority is getting this previously locked up behind supercomps HPC into the hands of as many developers as possible. My cofounder and I are lifelong nerds and developers, doing this because it matters.

    Right now, only two companies are truly competing in this space. You’ve fairly pointed out failures of Cerebras and Groq. AMD is the only one with a real shot at breaking the monopoly. They’re behind, yes. But they were behind in CPUs too, and look where that went. If AMD continues on the path they’re on now, they can absolutely become a viable alternative. Make no mistake, humanity needs an alternative and I'll do my best to make that a reality.

    • Ask Eric to consider writing a new blog post discussing the state of LLM training on AMD hardware. I would be very interested in reading what he has to say.

      AMD catching up in CPUs required that they become competent at hardware development. AMD catching up in the GPGPU space would require that they become competent at software development. They have a long history of incompetence when it comes to software development. Here are a number of things Nvidia has done right contrasted with what AMD has done wrong:

        * Nvidia aggressively hires talent. It is known for hiring freshly minted PhDs in areas relevant to them. I heard this firsthand from a CS professor whose specialty was in compilers who had many former students working for Nvidia. AMD is not known for aggressive hiring. Thus, they have fewer software engineers to put on tasks.
      
        * Nvidia has a unified driver, which reduces duplication of effort, such that their software engineers can focus on improving things. AMD maintains separate drivers for each platform. AMD tried doing partial unification with vulkan, but it took too long to develop, so the Linux community developed its own driver and almost nobody uses AMD’s unified Vulkan driver on Linux. Instead of killing their effort and adopting the community driver for both Linux and Windows, they continued developing their driver that is mostly only used on Windows.
      
        * Nvidia has a unified architecture, which further deduplicates work. AMD split their architecture into RDNA and CDNA, and thus must implement the same things for each where the two overlap. They realized their mistake and are making UDNA, but the damage is done and they are behind because of their RDNA+CDNA misadventures. It will not be until 2026 that UDNA fixes this.
      
        * Nvidia proactively uses static analysis tools on their driver, such as coverity. This became public when Nvidia open sourced the kernel part of their Linux driver. I recall a Linux kernel developer that works on static analysis begging the amdgpu kernel driver developers to use static analysis tools on their driver, since there were many obvious issues that were being caught by static analysis tools that were going unaddressed.
      

      There are big differences between how Nvidia and AMD do engineering that make AMD’s chances of catching up slim. That is likely to be the case until they start behaving more like Nvidia in how they do engineering. They are slowly moving in that direction, but so far, it has been too little, too late.

      By the way, AMD’s software development incompetence applies to the CPU side of their business too. They had numerous USB issues on the AM4 platform due to bugs in AGESA/UEFI. There were other glitches too, such and memory incompatibilities. End users generally had to put up with it, although some AMD in conjunction with some motherboard vendors managed to fix it the issues. I had an AM4 machine that would not boot reliably with 128GB of RAM and this persisted until I replaced the motherboard with one of the last AM4 motherboards made after suffering for years. Then there was this incompetence that even affected AM5:

      https://blog.desdelinux.net/en/Entrysign-a-vulnerability-aff...

      AMD needs to change a great deal before they have any hope of competing with Nvidia GPUs in HPC. The only thing going for them in HPC for GPUs is that they have relatively competent GPU hardware design. Everything else about their GPUs have been a disaster. I would not be surprised if Intel manages to become a major player in the GPU market before AMD manages to write good drivers. Intel, unlike AMD, has a history of competent software development. The major black mark on their history would be the initial Windows ARC drivers, but the were able to fix a remarkable number of issues in the time since their discrete GPU launch, and have fairly good drivers on Windows now. Unlike AMD, they did not have a history of incompetence, so the idea that they fixed the vast majority of issues is not hard to believe. Intel will likely continue to have good drivers after they have made competitive hardware to pair with them, provided that they have not laid off their driver developers.

      I have more hope in Intel than I have in AMD and I say that despite knowing how bad Intel is at doing anything other than CPUs. No matter how bad Intel is at branching into new areas, AMD is even worse at software development. On the bright side, Intel’s GPU IP has a dual role, since it is needed for their CPU’s iGPUs, so Intel must do the one thing they almost never do when branching into new areas, which is to iterate. The cost of R&D is thus mostly handled by their iGPUs and they can continue iterating on their discrete graphics until it is a real contender in the market. I hope that they merge Gaudi into their GPU development effort, since iterating on ARC is the right way forward. I think Intel having an “AMD moment” in GPUs is less of a longshot than AMD’s recovery from the AM3 fiasco and less of a long shot than AMD becoming competent at driver development before Intel either becomes good at GPGPU or goes out of business.

      2 replies →