Comment by spps11

8 months ago

Thanks for sharing, enjoyed reading it!

I have a slightly tangential question: Do you have any insights into what exactly DeepSeek did by bypassing CUDA that made their run more efficient?

I always found it surprising that a core library like Cuda, developed over such a long time, still had room for improvement—especially to the extent that a seemingly new team of developers could bridge the gap on their own.

12 comments

spps11

saagarjha 8 months ago

They didn’t. They used PTX, which is what CUDA C++ compiles down to, but which is part of the CUDA toolchain. All major players have needed to do this because the intrinsics for the latest accelerators are not actually exposed in the C++ API, which means using them requires inline PTX at the very minimum.

t55 8 months ago

They basically ditched CUDA and went straight to writing in PTX, which is like GPU assembly, letting them repurposing some cores for communication to squeeze out extra performance. I believe that with better AI models and tools like Cursor, we will move to a world where you can mold code ever more specific to your use case to make it more performant.

suresk 8 months ago
Are you sure they ditched CUDA? I keep hearing this, but it seems odd because that would be a ton of extra work to entirely ditch it vs selectively employing some ptx in CUDA kernels which is fairly straightforward.
Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.
1. https://arxiv.org/abs/2412.19437
- t55 8 months ago
  
  I should have been more precise, sorry. Didn't want to imply they entirely ditched CUDA but basically circumvented it in a few areas like you said.
pjmlp 8 months ago

Targeting directly PTX is perfectly regular CUDA, and used by many toolchains that target the ecosystem.
CUDA is not only C++, as many mistake it for.
spps11 8 months ago
got it, thanks for explaining.
> with better AI models and tools like Cursor, we will move to a world where you can mold code ever more specific to your use case to make it more performant
what do you think the value of having the right abstraction will be in such a world?
- t55 8 months ago
  
  I think that for at least for us dumb humans with limited memory, having good abstractions makes things much easier to understand
  
  5 replies →