← Back to context

Comment by barrkel

8 years ago

Suppose you have 8 runnable processes and a 4 core / 8 thread system.

With 4 execution units, 4 processes run at a time, and they all get swapped out every scheduling tick (losing all their cached lines). OTOH each process gets a full measure of cache to use during its slice.

With 8 execution units, 8 processes "run" at a time, they interleave based on stalls and CPU resources, and the OS doesn't need to reschedule anything every tick (so they hopefully keep their cache lines hot). But each process gets a half measure of cache to use.

In reality, code tuned to use a full measure of cache will be better off matching the number of processes to the number of execution units available, so you'd run half the number of processes with HT disabled. And cache-tuned code tends to fall off a performance cliff when it exceeds cache available, so it may easily run more than twice as fast, depending on the work.

The win from HT depends on most code not being tuned to full measures of cache, and having a lot of memory stalls or other heterogeneous work that other work can fit into. And most code is like that. Cache tends to have a declining marginal return - you have to add exponentially more cache to avoid cache misses - https://en.wikipedia.org/wiki/Power_law_of_cache_misses .