Comment by 0manrho

3 days ago

People pulling their heads out of their ass as to how to actually deploy these systems at scale (AKA to do this effectively, you need to do more than just throw pallets of GPU's at it, such as properly considering Topologies of both NVMe-over-Fabric and PCIe roots/lanes [0]) combined with advances in various technologies (eg RDMA, CXL, cuDF/BaM/GPUD2S/etc) that meaningfully enhance how system ram can be integrated and leveraged are a big part of it.

Also we're hitting that 5 years after DDR5 being readily available which means that a lot of existing enterprise hardware that was on DDR4 is going EOL and being replaced with DDR5 which, given many platforms these days have many more channels available than previously, results in more DRAM being bought than was previously used per node and in total. A lot of enterprise was still buying new DDR4 into 2023 as it was a more affordable way to deploy systems with lots of PCIe lanes which was more important than any the costs associated with the performance gain from DDR5 or related CPU's. (Also, early days DDR5 wasn't really any faster than DDR4 with how loose the timing was unless you were willing to pay a BIG premium)

Regarding the hype of the day: AI specifically, part of it is the rise of wrappers and agents and inference in general that can run on CPU's/leverage system ram. These usecases aren't as sensitive to latency as the training side of things as the network latency from the remote user to the datacenter means latency hits due to hitting the CPU ringbus(infinity fabric, QPI, whatever you want to call it) results in a much less significant share over the overall overhead, and the cost/benefit/availability concerns there has also increased the demand for non-GPU AI compute and RAM.

I wouldn't rule out corruption/price fixing (They've done it before) but I have no evidence of this. Wouldn't surprise me, but I don't think this is it (unless this problem persists for several quarters/years)

There's some geopolitics and FOMO (Corporate keeping up with the joneses) and economics that goes into this as well but I can't really speculate on that specifically, that's not really my area of expertise. Suffice to say, it's kind of like a bank run where it's not so much that the demand itself hit the curve of the hockey stick, but it was gradually increasing until it hit a threshold that was starting to cause delays in delivery/deployments. Given how important many companies view being on the cutting edge here, this lead to sudden spike in volume customers willing to pay premiums for early delivery to hit deployment deadlines, artificially inflating demand and further constraining supply, which just fed back into that feedback loop pushing transient demand even higher.

0: Yes NVMe NAND flash is different than DRAM flash, but the systems/clusters that host the NVMe JBOD's tend to use lots of sysRAM for their index/metadata/"superhot" data layer (think memcached, Redis, the MDS nodes for Lustre, etc), and with the advent of CXL and SCM you can deploy even more DRAM to a cluster/fabric than what is strictly presented by the CPU/mobo's memory controllers/channels. This is not driving overall market volume, but is a source of fierce competition for supply at the very "top" of the DRAM/Flash market.

TL;DR: Convergence of a lot of things driving demand.

>People pulling their heads out of their ass as to how to actually deploy these systems at scale (AKA to do this effectively, you need to do more than just throw pallets of GPU's at it)

Yeah something people don't understand is that the models have become so big it can take minutes to load them from SSDs. If you need to restart a CUDA process for whatever reason, you'd rather want to load the model files from RAM. This means for every GB of VRAM, you also need a GB of system RAM. Then there are things like prefix caching and multi user KV caching. Users generally don't do all their requests one after the other in a short window of time. This means you are better off freeing up VRAM after a minute has passed. If the user sends another request, using the system RAM as cache is still more energy, time and VRAM efficient than recalculation. DDR5 based DRAM is incredibly cheap.

>Regarding the hype of the day: AI specifically, part of it is the rise of wrappers and agents and inference in general that can run on CPU's/leverage system ram.

It has more to do with the dominance of mixture of expert models. Due to expert sparsity the required memory bandwidth drops quite significantly. It is possible to run gpt-oss 20B on a computer with 32GB RAM, a segment that used to be reserved for enthusiasts and developers has now become the mainstream amount of RAM on desktop PCs and mini PCs.

Yeah so if I had to summarize, the problem is that DDR4 is EOL, shifting demand to DDR5. AI demand means people want more than 16GiB RAM (there is actually a flood of used 2x8 GB kits in the laptop DDR5 market). DRAM manufacturers switch to supplying AI data centers and stopped resupplying retailers. Retailers are running out of inventory, leading to a sharp rise in prices. Early production of DDR6 will be out in 2026 with consumer availability in 2027, so there is zero incentive to expand production for DDR5.

  • > DDR4 is EOL

    11 years after specification released.

    DDR4 makes it possible since 5 and half years to build 128GB AMD PC using 4x 32GB DDR4-3200 1.2V JEDEC modules. Only since half a year it is at last possible to build AMD PC with 128GB DDR5-5600 RAM. Because Ryzen DDR5 controller cannot operate with 2 sticks sharing the same channel at JEDEC speed/voltage.

    DDR4 just cannot be EOL already because of unbuffered ECC! Even today it is not possible to have 128GB with 2 DDR5 sticks on AMD. Only 96GB.

    And actually I'd expect 256GB (DDR5-8800 1.1V JEDEC) to be possible with DDR5. We're now only from 4000 to 5600 5(!) years since the specification was published. What if AMD achieves this only few months before DDR5 EOL?..

    > DDR6 will be out in 2026 with consumer availability in 2027

    ... or not at all?..