Comment by AnthonyMouse

21 hours ago

> The way to manufacture more efficient compute now is do things like put DRAM closer to the chip and even closer integration between CPU and GPU.

People have been hyping things like this for decades, but then it turns out the number of applications that need to frequently share data between a CPU and GPU at a faster speed than PCIe can handle are pretty uncommon. Meanwhile putting them closer together has some pretty significant real disadvantages, because then you're trying to deliver more power and dissipate more heat over a smaller area instead of putting more physical separation between the two largest loads in the machine.

Notice that high end PC GPUs are significantly faster than any of Apple's integrated GPUs, and that's why.

> There are also latency and bandwidth benefits how they setup their RAM just from pure physics.

Soldering RAM has a modest latency advantage over SODIMMs at the most extreme timings and CAMM turns even that into basically nothing.

> And chip manufacturing is moving towards chiplets where you have cores manufactured separately and then wired together at nanoscale level on top of a silicon interposer.

You're describing a move to less integration. They were originally on the same die, and the change has no real effect on modularity. The user doesn't even have to know that some Ryzen CPUs have a separate I/O die or more than one compute die, they all still fit into the same socket and are even interchangeable with the ones that have only a single die.

4 comments

AnthonyMouse

LarsDu88 20 hours ago

- For high end AI inference chips, DRAM already goes onto the interposer right next to the GPU to bring the bandwidth as high as possible. Apple will eventually do this for the exact same reasons. It's not just soldering RAM to a PCB - The chiplet technique and putting everything on an interposer is less integrated from the perspective of the chip manufacturer, but for the consumer -- folks who are going to buy Framework laptops, this is a far less integrated package. CPU, GPU and RAM will sit on the same interposer and purchased together as a unit with no upgrade or swap path for any component. This is not the same as simply soldering everything together on one PCB. The level of intergration is far higher

AnthonyMouse 20 hours ago
> For high end AI inference chips, DRAM already goes onto the interposer right next to the GPU to bring the bandwidth as high as possible.
The high end AI inference chips use HBM and cost tens of thousands of dollars. HBM uses 1024 data pins instead of 64, which is crazy expensive, which means that to the extent that consumer devices get it at all, it would be in addition to rather than instead of ordinary DRAM, e.g. you might have 12GB of HBM on the CPU package but then 64GB of less expensive DRAM. Increasing the number of cache hierarchy levels is a long-term trend. HBM as L4 cache is pretty plausible for high end CPUs as a supplement rather than replacement for DRAM.
There are already servers that work like this, e.g. Xeon Max has 64GB of HBM but then further supports up to 4TB of DDR5.
Moreover, the AI inference hardware integrates the CPU into the GPU because it's really just a giant GPU. They're not getting some major advantage from that, they just know nobody is going to want to swap out the CPU on a system where the CPU is mostly irrelevant. If you wanted that level of inference performance on a normal PC which is used for other purposes where the CPU actually matters then you would drop the AI accelerator with the HBM or GDDR into a PCIe slot.
- LarsDu88 19 hours ago
  
  I think the long term trend is typically the high end technology of today will be the mid to low tier technology of the future.
  If putting 1024 data pins all connected via a nanoscale manufactured silicon interposer right now seems complicated and expensive, that doesn't mean we won't see it in tomorrow's consumer devices. If anything we will be MORE likely to see this one day. Apple and other companies are gradually working towards moving AI models to be more local which means memory bandwidth has a real killer app use case right now. Witness Liquid AI and their partnership with Mercedes Benz to put 8B param LLM models into vehicles.
  Both Desktop PCs and the CPU are becoming less and less relevant as we move further in the decade to be honest...
  
  1 reply →