Comment by rwmj

9 months ago

Slightly off topic, but if I'm aiming to get the fastest 'make -jN' for some random C project (such as the kernel) should I set N = #P [threads] + #E, or just the #P, or something else? Basically, is there a case where using the E cores slows a compile down? Or is power management a factor?

I timed it on the single Intel machine I have access to with E-cores and setting N = #P + #E was in fact the fastest, but I wonder if that's a general rule.

11 comments

rwmj

adrian_b 9 months ago

On my tests on an AMD Zen 3 (a 5900X), I have determined that with SMT disabled, the best performance is with N+1 threads, where N is the number of cores.

On the other hand, with SMT enabled, the best performance has been obtained with 2N threads, where N is the same as above, i.e. with the same number of threads as supported in hardware.

For example, on a 12C/24T 5900X, it is best to use "make -j13" if SMT is disabled, but "make -j24" if SMT is enabled.

For software project compilation, enabling SMT is always a win, which is not always the case for other applications, i.e. for those where the execution time is dominated by optimized loops.

Similarly, I expect that for the older Meteor Lake, Raptor Lake and Alder Lake CPUs the best compilation speed is achieved with 2 x #P + #E threads, even if this should improve the compilation time by only something like 10% over that achieved with #P + #E threads. At least the published compilation benchmarks are consistent with this expectation.

EDIT: I see from your other posting that you have used a notation that has confused me, i.e. that by #P you have meant the number of SMT threads that can be executed on P cores, not the number of P cores.

Therefore what I have written as 2 x #P + #E is the same as what you have written as #P + #E.

So your results are the same with what I have obtained on AMD CPUs. With SMT enabled, the optimal number of threads is the total number of threads supported by the CPU, where the SMT cores support multiple threads.

Only with SMT disabled an extra thread over the number of cores becomes beneficial.

The reason for this difference in behavior is that with SMT disabled, any thread that is stalled by waiting for I/O leaves a CPU core idle, slowing the progress. With SMT enabled, when any thread is stalled, the threads are redistributed to balance the workload and no core remains idle. Only 1 of the P cores runs a thread instead of 2, which reduces its throughput by only a few percent. Avoiding this small loss in throughput by running an extra thread does not provide enough additional performance to compensate the additional overhead caused by an extra thread.

wmf 9 months ago

Power management is a factor because the cores "steal" power from each other. However the E-cores are more efficient so slowing down P-cores and giving some of their power to the E-cores increases the efficiency and performance of the chip overall. In general you're better off using all the cores.

jeffbee 9 months ago

I suggest this depends on the exact model you are using. On Alder/Raptor Lake, the E-cores run at 4.5GHz which is completely futile, and in doing so they heat their adjacent P-cores, because 2x E-core clusters can easily draw 135W or more. This can significantly cut into the headroom of the nearby P. Arrow Lake-S has rearranged the E-cores.

saurik 9 months ago

Did you test at least +1 if not *1.5 or something? I would expect you to occasionally get blocked on disk I/O and would want some spare work sitting hot to switch in.

rwmj 9 months ago
Let me test that now. Note I only have 1 Intel machine so any results are very specific to this laptop.
-j time (mean ± σ) 12 (#P+#E) 130.889 s ± 4.072 s 13 (..+1) 135.049 s ± 2.270 s 4 (#P) 179.845 s ± 1.783 s 8 (#E) 141.669 s ± 3.441 s
Machine: 13th Gen Intel(R) Core(TM) i7-1365U; 2 x P-cores (4 threads), 8 x E-cores
- wtallis 9 months ago
  
  Your processor has two P cores, and ten cores total, not twelve. The HyperThreading (SMT) does not make the two P cores into four cores. Your experiment with 4 threads will most likely result in using both P cores and two E cores, as no sane OS would double up threads on the P cores before the E cores were full with one thread each.
  
  5 replies →