Comment by tareqak
8 years ago
With the newer AMD processors having as many real cores as they do, does the cost-benefit analysis of HT/SMT change? I read in a comment here a few weeks ago that turning it off on the newer AMD CPUs can yield better performance because of improved cache-coherency on some workloads (My memory of what I read might be totally wrong).
Yes, but it's often a case where deploying finer-grained explicit parallelism works in favor, due to the much longer dependency chains that can be hid by the second thread. There are architectures not impacted by this issue, mostly ones with explicit dependency tagging long enough to handle a dTLB fault, but you need closer to double the registers and interleaving by the compiler to get abotu the same performance as with SMT-2 (aka, hyperthreading).
I'll defer to whatever the benchmarks say of course, but I don't see why HT would affect cache coherency for normal workloads. If you disable HT you'd still have the same number of threads/processes running on the system, so you still have to schedule the same amount of work and do the same number of context switches.
Suppose you have 8 runnable processes and a 4 core / 8 thread system.
With 4 execution units, 4 processes run at a time, and they all get swapped out every scheduling tick (losing all their cached lines). OTOH each process gets a full measure of cache to use during its slice.
With 8 execution units, 8 processes "run" at a time, they interleave based on stalls and CPU resources, and the OS doesn't need to reschedule anything every tick (so they hopefully keep their cache lines hot). But each process gets a half measure of cache to use.
In reality, code tuned to use a full measure of cache will be better off matching the number of processes to the number of execution units available, so you'd run half the number of processes with HT disabled. And cache-tuned code tends to fall off a performance cliff when it exceeds cache available, so it may easily run more than twice as fast, depending on the work.
The win from HT depends on most code not being tuned to full measures of cache, and having a lot of memory stalls or other heterogeneous work that other work can fit into. And most code is like that. Cache tends to have a declining marginal return - you have to add exponentially more cache to avoid cache misses - https://en.wikipedia.org/wiki/Power_law_of_cache_misses .
The working set at any point in time might be larger with HT and and less likely to fit in cache.
some of the heavily multithreaded applications I use at work see up to a ~27% loss in performance by disabling SMT, others don't see much of a loss at all (on AMD EPYC)