Comment by blindriver
2 months ago
I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?
Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.
I’m confused as to why you think a GPU is necessary? It’s just linear algebra.
most likely he was referring the fact that you need plenty of GPU-fast memory to keep the model, and GPU cards have it.
There is nothing magical about GPU memory though. It’s just faster. But people have been doing CPU inference since the first llama code came out.