Comment by p1esk

1 day ago

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

11 comments

p1esk

zzzoom 1 day ago

EPYC 9965: 614GBps of 12-channel DDR5-6400

A100: 1935GBps of HBM2e

Most of those FLOPS are constrained by memory bandwidth.

Const-me 21 hours ago
> Most of those FLOPS are constrained by memory bandwidth
I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.
Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.
- zzzoom 17 hours ago
  
  Prefill (GEMM) is compute bound, decode (GEMV) is memory bound.
  
  1 reply →

tosh 1 day ago

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

p1esk 1 day ago
Intel Xeon 6980P: 128 cores x 1024 FP16 FLOP/cycle/core x 3.2 GHz: 419 TFLOP/s
- tosh 1 day ago
  
  I'm not saying "GPU more brrt than CPU"
  I found the comparison interesting
  on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:
  how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?
  no?
  
  1 reply →

aesthesia 1 day ago

That's also a CPU that came out four years later than the A100. The contemporaneous B200 is not optimized for FP32 and does 74.45 TFLOP/s. For FP16 it's at ~2 PFLOP/s.

p1esk 21 hours ago

The point is that modern CPUs are not as slow as most DL people think. Roughly 10x slower but with a lot more memory.