Comment by jmward01
2 days ago
The point is not how fast it is now. The point is that this opens new possibilities that can be built on. Potentially models that are trained with slightly different architectures to optimize to this use case. Possibly others come to improve this path. Possibly HW manufacturers make a few small adjustments that remove bottlenecks. Who knows, the next person may combine CPU compute with this mem sharing to get another token a second. Then the next person does predictive loading into memory to keep that bandwith 100% maxed and usable. Then the next does and the next does. Before you know it there is a real thing there that never existed.
This is a great project. I love the possibilities it hints at. Thanks for building it!
It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.
You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
On some workloads, swapping is a bad idea.
The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.
This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.
In the long term, compute is probably going to move towards the memory.
3 replies →
> You’re basically stating that swapping is also a bad idea.
Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.
3 replies →
Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows
3 replies →
It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.
It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.
Yes, swapping is a bad idea.
Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.
Some people are not concerned with having it run the fastest, just having it run at all may be enough.
From my experience, accessing system RAM from the GPU is so slow, it might as well count as "does not work". It's orders of magnitudes faster to memcpy large swaths of memory that you are going to use to the GPU, rather than accessing system mem from a kernel which then takes ages to wait for that small block/page of memory, then waits again for the next small page/block of memory, etc. Latency hiding doesnt work anymore if the latency is that large.
1 reply →
But then you can use CPU/RAM offload, which already allows you to offload without a kernel module.
> It’s architecturally not a good approach.
Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:
- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.
- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in
- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.
These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.
[dead]