Comment by djsjajah
20 hours ago
GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm. It’s the whole reason why low precision floating point numbers are being pushed by nvidia.
20 hours ago
GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm. It’s the whole reason why low precision floating point numbers are being pushed by nvidia.
That's memory bandwidth, not I/O. Unless your LLM doesn't fit into VRAM.