← Back to context

Comment by adrian_b

11 hours ago

On my tests on an AMD Zen 3 (a 5900X), I have determined that with SMT disabled, the best performance is with N+1 threads, where N is the number of cores.

On the other hand, with SMT enabled, the best performance has been obtained with 2N threads, where N is the same as above, i.e. with the same number of threads as supported in hardware.

For example, on a 12C/24T 5900X, it is best to use "make -j13" if SMT is disabled, but "make -j24" if SMT is enabled.

For software project compilation, enabling SMT is always a win, which is not always the case for other applications, i.e. for those where the execution time is dominated by optimized loops.

Similarly, I expect that for the older Meteor Lake, Raptor Lake and Alder Lake CPUs the best compilation speed is achieved with 2 x #P + #E threads, even if this should improve the compilation time by only something like 10% over that achieved with #P + #E threads. At least the published compilation benchmarks are consistent with this expectation.

EDIT: I see from your other posting that you have used a notation that has confused me, i.e. that by #P you have meant the number of SMT threads that can be executed on P cores, not the number of P cores.

Therefore what I have written as 2 x #P + #E is the same as what you have written as #P + #E.

So your results are the same with what I have obtained on AMD CPUs. With SMT enabled, the optimal number of threads is the total number of threads supported by the CPU, where the SMT cores support multiple threads.

Only with SMT disabled an extra thread over the number of cores becomes beneficial.

The reason for this difference in behavior is that with SMT disabled, any thread that is stalled by waiting for I/O leaves a CPU core idle, slowing the progress. With SMT enabled, when any thread is stalled, the threads are redistributed to balance the workload and no core remains idle. Only 1 of the P cores runs a thread instead of 2, which reduces its throughput by only a few percent. Avoiding this small loss in throughput by running an extra thread does not provide enough additional performance to compensate the additional overhead caused by an extra thread.