Comment by seunosewa

1 day ago

"How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly."

What's he talking about? It's trivial to calculate that.

Isn't the ability to run it more dependant on (V)RAM? With TOPS just dictating the speed at which it runs?

  • Strictly speaking, you don't need that much VRAM or even plain old RAM - just enough to store your context and model activations. It's just that as you run with less and less (V)RAM you'll start to bottleneck on things like SSD transfer bandwidth and your inference speed goes down to a crawl. But even that may or may not be an issue depending on your exact requirements: perhaps you don't need your answer instantly and can wait while it gets computed in the background. Or maybe you're running with the latest PCIe 5 storage which overall gives you comparable bandwidth to something like DDR3/DDR4 memory.

  • A good rule of thumb is that PP (Prompt Processing) is compute bound while TG (Token Generation) is (V)RAM speed bound.

It’s trivial to ask an AI to answer that. Well, I guess we know it’s not an AI generated article!

> state-of-the-art models

> hundreds of millions of parameters

lol

lmao, even