Comment by mattnewton

20 hours ago

people are trying, especially for inference. For training, it’s just too high risk to tank your training I think.

TPUs are at least dogfooded by Google deepmind, no team AFAIK has gotten the AMD stack to train well.

Interesting. Why? My current mental model is that AMD chips are just a bit behind, so, less efficient, but no biggie. Do labs even use CUDA?

  • This is somewhat out of date (Dec 2024), but gives you some idea of how far behind AMD was then: https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200...

    Pull quotes:

    AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.

    [snip]

    > The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us

    [snip]

    > AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.

    etc etc. The whole thing is worth reading.

    I'm sure it has (and will continue to) improved since then. I hear good things about the Lemonade team (although I think that is mostly inference?)

    But the NVidia stack has improved too.

    • That’s insane. There should be a big team of people at AMD whose whole job is just to dogfood their stuff for training like this. Speaking of which, Amazon is in the same boat, I’m constantly surprised that Amazon is not treating improving Inferentia/Trainium software as an uber-priority. (I work at Amazon)

      6 replies →

    • Anecdotal but over several years with an AMD GPU in my desktop I've tried multiple times to do real AI work and given up every time with the AMD stack.

      1 reply →

    • Yet another reason to doubt claims that ”software is solved”.

      Anthropic did retire an interview take-home assignment involving optimising inference on exotic hardware, because Claude could one shot a solution, but that was clearly a whiteboard hypothetical instead of a real system with warts, issues and nuance.

  • This is what I've heard on the "street". Building a CUDA-compatible stack for AMD's hardware requires highly-paid SWEs. It's a very niche field, and talent is hard to come by.

    But AMD does not want to pay these specialized SWEs the market rate. Their existing SWEs would be up in arms saying, basically, "what are we, chopped liver??", or so the thinking goes.

    So AMD is stuck with a shitty software stack which cannot compete with CUDA.

    If I were making such decisions, I would just cull the number of existing SWEs down by 50%, and double the pay for remaining ones. And then go out and hire some top talent to build a good software stack.

  • i'm doing inference on a free mi300x instance from AMD right now. not sure if the software stack is just old or what, but here's what i've observed: stuck on an old version of vllm pre-Transformers 5 support. it lacks MoE support for qwen3 models. oss-120b is faaaar slower than it should be.

    int8 quantization seems like it's almost supported, but not quite. speeds drop to a fraction of full precision speed and the server seems like it intermittently hangs. int4 quantization not supported. fp8 quantization not supported.

    again, maybe AMD is just being lazy with what they've provided, but it's not a great look.

    right now the fastest smart model i can run is full precision qwen3-32b. with 120 parallel requests (short context) i'm getting PP @ 4500 tokens/sec and TG @ 1300 tokens/sec

  • > Do labs even use CUDA?

    From the papers I've read and the labs that I have worked in personally, I would say that most scientists developing Deep learning solutions use CUDA for GPU acceleration

  • I don’t know what’s a chicken and what’s an egg here. But ROCm support is often missing or experimental even in very basic foundational libraries. They need someone else to double down on using their chips and just break the software support out of the limbo.

  • amd gpus compete but they lack the interconnect. NVLink performance is a huge deal for training.

  • What I hear is that getting your network to work on AMD is a huge pain.

    • Yeah, historically it’s been software that’s limited AMD here. Not surprised to hear that may still be the issue. NVidia’s biggest edge was really CUDA.

      2 replies →