← Back to context

Comment by zozbot234

3 hours ago

A quick rule of thumb is that one or two bytes per peak clock cycle per core or so (not unlike an old 8 bit or 16 bit machine!) is the worst case for memory bandwidth when running highly multithreaded workloads that heavily access main RAM outside cache. So there's a lot of gain to be had before memory bandwidth is truly saturated, and even then one can plausibly move to GPU-based compute and speed things up further. (Unified memory+HBM may potentially add a 2x or 3x multiplier to this basic figure, but either way it's in the ballpark.)