Comment by bigyabai

16 hours ago

A lot of the TDP is reserved for running the shader units at full-power. My RTX 3070 Ti only pulls ~110w of it's 320w running CUDA inference on Gemma 26b and E4B.

8 comments

bigyabai

Scaevolus 16 hours ago

It's not that it's reserving power, but rather that you hit some bottleneck on a 3070 Ti before running into thermal limits-- it's likely limited by either tensor core saturation or RAM throughput. Running the workload with Nvidia's profiling tools should make the bottleneck obvious.

lambda 15 hours ago
Generally the bottleneck is RAM throughput. Inference, in particular token generation, especially on a single user instance, is not all that computationally complex; you're doing some fairly simple calculations for each parameter, the time is dominated by just transferring each parameter from RAM to the cores. A 31B dense model like Gemma 4 has to transfer 31B parameters (at 16 bits per parameter for the full model, though on consumer hardware people generally run 4-8 bit quantizations) from RAM to the cores, that's a lot of memory transfer.
Prompt processing or parallel token generation can do a bit more work per memory transfer, as you can use the same weights for a few different calculations in parallel. But even still, memory bandwidth is a huge factor.

ycui7 6 hours ago

B70 idles at 30W, while RTX PRO 4500 idles at 9W (measured to be 5W at wall).

B70 runs at 1/3 token output rate of RTX PRO 4500 and consume 3X idle power when do nothing.

culopatin 8 hours ago

My 4070 super and 5070 super both max out their tdp when I use them with ollama, is your usage different?

gambiting 14 hours ago

My 5090 runs at full TDP(pretty much exactly 575W) when running inference through LM Studio.

rao-v 11 hours ago
Cap the power to 400W you won’t see much impact
- gardnr 10 hours ago
  
  Same throughput with much less heat. Not sure what that extra 175w is going towards but it's diminishing returns.