Skymont: Intel's E-Cores reach for the Sky

13 hours ago (chipsandcheese.com)

Triple decoder is one unique effect. The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO. Very well done.

Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies.

I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

Big cores (P cores or AMD Zen5) obviously can split into 2 hyperthread, but what if that division is still too big? E cores are 4 threads of support in roughly the same space as 1 Pcore.

This is because L2 cache is shared/consolidated, and other resources (ROP buffers, register files, etc. etc.) are just all so much smaller on the Ecore.

It's an interesting design. I'd still think that growing the cores to 4way SMT (like Xeon Phi) or 8way SMT (POWER10) would be a more conventional way to split up resources though. But obviously I don't work at Intel or can make these kinds of decisions.

  • > The fact that Intel managed to get them lined up for small loops to do 9x effective instruction issue is basically miraculous IMO

    Not just small loops. It can reach 9x instruction decode on almost any control flow pattern. It just looks at the next 3 branch targets and starts decoding at each of them. As long as there is a branch every 32ish instructions (presumably a taken branch?), Skymont can keep all three uop queues full and Rename/dispatch can consume uops at a sustained rate of 8 uops per cycle.

    And in typical code, blocks with more than 32 instructions between branches are somewhat rare.

    But Skymont has a brilliant trick for dealing with long runs of branchless code too: It simply inserts dummy branches into the branch predictor, breaking them into shorter blocks that fit into the 32 entry uop queues. The 3 decoders will start decoding the long block at three different positions, leap-frogging over each-other until the entire block is decoded and stuffed into the queues.

    This design is absolutely brilliant. It seems to entirely solve the issue decoding X86, with far less resources than a uop cache. I suspect the approach will scale to almost unlimited numbers of decoders running in parallel, shifting the bottlenecks to other parts of the design (branch prediction and everything post decode)

    • Thanks for the explanation. I was wondering how the heck Intel did to make a 9-way decode x86–a low power core of all things. Seems like an elegant approach.

      2 replies →

  • While the frontend of Intel Skymont, which includes instruction fetching and decoding, is very original and unlike to that of any other CPU core, the backend of Skymont, which includes the execution units, is extremely similar to that of Arm Cortex-X4 (which is a.k.a. Neoverse V3 in its server variant and as Neoverse V3AE in its automotive variant).

    This similarity consists in the fact that both Intel Skymont and Arm Cortex-X4 have the same number of execution units of each kind (and there are many kinds of execution units).

    Therefore it can be expected that for any application whose performance is limited by the CPU core backend, the CPU cores Intel Skymont and Arm Cortex-X4 (or Neoverse V3) should have very similar performances.

    Moreover, Intel Skymont and Arm Cortex-X4 have the same die area, i.e. around 1.7 square mm (including with both cores 1 MB of L2 cache in this area). Therefore the 2 cores not only should have about the same performance for backend-limited applications, but they also have the same cost.

    Before Skymont, all the older Intel Atom cores had been designed to compete with the medium-size Arm Cortex-A7xx cores, even if the Intel Atom cores have always lagged in performance Cortex-A7xx by a year or two. For instance Intel Tremont had a very similar performance to Arm Cortex-A76, while Intel Gracemont and Crestmont have an extremely similar core backend with the series of Cortex-A78 to Cortex-A725 (like Gracemont and Crestmont, the 5 cores in the series Cortex-A78, Cortex-A710, Cortex-A715, Cortex-A720 and Cortex-A725 have only insignificant differences in the execution units).

    With Skymont, Intel has made a jump in E-core size, positioning it as a match for Cortex-X, not for Cortex-A7xx, like its predecessors.

  • Skymont is an improvement but...

    Skymont area efficiency should be compared to Zen 5C on 3nm. It has higher IPC, SMT with dual decoders - one for each thread, and full rate AVX-512 execution.

    AMD didn't have major difficulties in scaling down their SMT cores to achieve similar performance per area. But Intel went with different approach. At the cost of having different ISA support on each core in consumer devices and having to produce an SMT version of their P cores for servers anyway.

    • It should be noted that Intel Skymont has the same area and it should also have the same performance for any backend-limited application with Arm Cortex-X4 (a.k.a. Neoverse V3) (both use 1.7 square mm in the "3 nm" TSMC fabrication process, while a Zen 5 compact might have an almost double area in the less dense "4 nm" process, with full vector pipelines, and a 3 square mm area with reduced vector pipelines, in the same less dense process).

      Arm Cortex-X4 has the best performance per area of among the cores designed by Arm. Cortex-X925 has a double area in comparison with Cortex-X4, which results in a much lower performance per area. Cortex-A725 is smaller in area, but the area ratio is likely to be smaller than the performance ratio (for most kinds of execution units Cortex-X4 has a double number, while for some it has only a 50% or a 33% advantage), so it is likely that the performance per area of Cortex-A725 is worse than for Cortex-X4 and for Skymont.

      For any programs that benefit from vector instructions, Zen 5 compact will have a much better performance per area than Intel Skymont and Arm Cortex-X4.

      For programs that execute mostly irregular integer and pointer operations, there are chances for Intel Skymont and Arm Cortex-X4 to achieve better performance per area, but this is uncertain.

      Intel Skymont and Arm Cortex-X4 have a greater number of integer/pointer execution units per area than Zen 5 compact, even if Zen 5 compact were made with a TSMC process equally dense, which is not the case today.

      Despite that, the execution units of Zen 5 compact will be busy a much higher percentage of the time, for several reasons. Zen 5 is better balanced, it has more resources for ensuring out-of-order and multithreaded execution, it has better cache memories. All these factors result in a higher IPC for Zen 5.

      It is not clear whether the better IPC of Zen 5 is enough to compensate its greater area, when performing only irregular integer and pointer operations. Most likely is that in such cases Intel Skymont and Arm Cortex-X4 remain with a small advantage in performance per area, i.e. in performance per dollar, because the advantage in IPC of Zen 5 (when using SMT) may be in the range of 10% to 50%, while the advantage in area of Intel Skymont and Arm Cortex-X4 might be somewhere between 50% and 70%, had they been made with the same TSMC process.

      On the other hand, for any program that can be accelerated with vector instructions, Zen 5 compact will crush in performance per area (i.e. in performance per dollar) any core designed by Intel or Arm.

  • > It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

    Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

    • Not... quite. I think you've got the cause-and-effect backwards.

      Programmers who happen to write multiple-threaded programs don't need powerful cores, they want more cores. A Blender programmer calculating cloth physics would rather have 4x weaker cores than 1x P-core.

      Programmers who happen to write powerful singled-threaded programs need powerful cores. For example, AMD's "X3D" line of CPUs famously have 96MB of L3 cache, and video games that are on these very-powerful cores have much better performance.

      Its not "Programmers should change their code to fit the machine". From Intel's perspective, CPU designers should change their core designs to match the different kinds of programmers. Single-threaded (or low-thread) programmers... largely represented by the Video Game programmers... want P-cores. But not very much of them.

      Multithreaded programmers... represented by Graphics and a few others... want E-cores. Splitting a P-core into "only" 2 threads is not sufficient, they want 4x or even 8x more cores. Because there's multiple communities of programmers out there, dedicating design teams to creating entirely different cores is a worthwhile endeavor.

      --------

      > Does that mean if I can take a single-threaded program and split it into multiple threads, it might use less power? I have been telling myself that the only reason to use threads is to get more CPU power or to call blocking APIs. If they're actually more power-efficient, that would change how I weigh threads vs. async

      Power-efficiency is going to be incredibly difficult moving forward.

      It should be noted that E-cores are not very power-efficient though. They're area efficient, IE Cheaper for Intel to make. Intel can sell 4x as many E-cores for roughly the same price/area as 1x P-core.

      E-cores are cost-efficient cores. I think they happen to use slightly less power, but I'm not convinced that power-efficiency is their particular design goal.

      If your code benefits from cache (ie: big cores), its probable that the lowest power-cost would be to run on large caches (like P-cores or Zen5 or Zen5 X3D). Communicating with RAM is always more power than just communicating with caches after all.

      If your code does NOT benefit from cache (ie: Blender regularly has 100GB+ scenes for complex movies), then all of those spare resources on P-cores are useless, as nothing fits anyway and the core will be spending almost all of its time waiting on RAM to do anything. So the E-core will be more power efficient in this case.

      13 replies →

  • I see "ROP" and immediately think of Return Oriented Programming and exploits...

    • Lulz, I got some wires crossed. The CPU resource I meant to say was ROB: Reorder Buffer.

      I don't know why I wrote ROP. You're right, ROP means return oriented programming. A completely different thing.

  • "Another unique effect is L2 shared between 4 cores. This means that thread communications across those 4 cores has much lower latencies."

    @dragontamer solid point. Consider a in memory ring shared between two threads. There's huge difference in throughput and latency if the threads share L2 (on same core) or when on different cores all down to the relative slowness of L3.

    Are there other cpus (arm, graviton?) that have similarly shared L2 caches?

    • Hyperthreading actually shares L1 caches between two threads (after all, two threads are running in the same L1 cache and core).

      I believe SMT4 and SMT8 cores from IBM Power10 also have L1 caches shared (8 threads on one core), and thus benefit from communication speeds.

      But you're right in that this is a very obscure performance quirk. I'm honestly not aware of any practical code that takes advantage of this. E-cores are perhaps the most "natural" implementation of high-speed core-to-core communications though.

  • What we desperately need before we get too deep into this is stronger support in languages for heterogeneous cores in an architecture agnostic way. Some way to annotate that certain threads should run on certain types of cores (and close together in memory hierarchy) without getting too deep into implementation details.

    • I don't think so. I don't trust software authors to make the right choice, and the most tilted examples of where a core will usually need a bigger core can afford to wait for the scheduler to figure it out.

      And if you want to be close together in the memory hierarchy, does that mean close to the RAM that you can easily get to? And you want allocations from there? If you really want that, you can use numa(3).

      > without getting too deep into implementation details.

      Every microarchitecture is a special case about what you win by being close to things, and how it plays with contention and distances to other things. You either don't care and trust the infrastructure, or you want to micromanage it all, IMO.

  • > I've had lots of debates with people online about this design vs Hyperthreading. It seems like the overall discovery from Intel is that highly threaded tasks use less resources (cache, ROPs, etc. etc).

    AMD did something similar before. Anyone don't remember the Bulldozer cores sharing resources between pair of cores ?

    • AMD's modern SMT implementation is just better than Bulldozer's decoder sharing.

      Modern Zen5 has very good SMT implementations. Back 10 years ago, people mostly talked about Bulldozer's design to make fun of it. It always was intriguing to me but it just never seemed to be practical in any workflow.

Intel advertising the fact that their schedulers can keep MS Teams confined to the efficiency cores... what a sad reflection of how bloated Teams is.

We make a single Electron-like app grow, cancer-like, to do everything from messaging and videoconferencing to shared drive browsing and editing, and as a result we have to contain it.

  • It can run in your browser too.The electron part isn't the bloat but the web part. Web devs keep using framework on top of frameworks and the bloat is endless. Lack of good native UX kits forces devs to use web-based kits. Qt has a nice idea with qml but aside from some limitations, it is mostly C++ (yes, pyqt,etc.. exist).

    Native UI kits should be able to do better than web-based kits. But I suspect just as with the web, the problem is consistency. The one thing the web does right is deliver consistent UI experience across various hardware with less dev time. It all comes down to which method has least amounts of friction for devs? Large tech companies spent a lot of time and money in dev tooling for their web services, so web based approaches to solve problems inherently have to be taken for even trivial apps (not that teams is one).

    Open source native UX kits that work consistently across platforms and languages would solve much of this. Unfortunately, the open source community is stuck on polishing gtk and qt.

    • > The electron part isn't the bloat but the web part.

      The bloat part is the bloat. Web apps can made to be perfectly performant if you are diligent, and native apps can be made to be bloated and slow if you're not.

    • >But I suspect just as with the web, the problem is consistency.

      That is indeed the "problem" at its core. People are lazy, operating systems aren't consistent enough.

      Operating systems came about to abstract all the differences of countless hardware away, but that is no longer good enough. Now people want to abstract away that abstraction: Chrome.

      Chrome is the abstraction layer to Windows, MacOS, iOS, Linux, Android, BSD, C++, HTML, PHP, Ruby, Rust, Python, desktops, laptops, smartphones, tablets, and all the other things. You develop for Chrome and everyone uses Chrome and everyone on both sides gets the same thing for a singular effort.

      If I were to step away from all I know and care about computers and see as an uncaring man, I have to admit: It makes perfect sense. Fuck all that noise. Code for Chrome and use by Chrome. The world can't be simpler.

  • These days you need a CPU, a GPU, a NPU and a TPU (not Tensor, but Teams Processing Unit).

    In my case, the TPU is a seperate Mac that also does Outlook, and the real work gets done on the Linux laptop. I refer to this as the airgap to protect my sanity.

  • Teams is technically WebView2 and not Electron… with a bunch of native platform code.

  • It’s amazing what computers can do today and how absolutely inefficient they are used.

Slightly off topic, but if I'm aiming to get the fastest 'make -jN' for some random C project (such as the kernel) should I set N = #P [threads] + #E, or just the #P, or something else? Basically, is there a case where using the E cores slows a compile down? Or is power management a factor?

I timed it on the single Intel machine I have access to with E-cores and setting N = #P + #E was in fact the fastest, but I wonder if that's a general rule.

  • On my tests on an AMD Zen 3 (a 5900X), I have determined that with SMT disabled, the best performance is with N+1 threads, where N is the number of cores.

    On the other hand, with SMT enabled, the best performance has been obtained with 2N threads, where N is the same as above, i.e. with the same number of threads as supported in hardware.

    For example, on a 12C/24T 5900X, it is best to use "make -j13" if SMT is disabled, but "make -j24" if SMT is enabled.

    For software project compilation, enabling SMT is always a win, which is not always the case for other applications, i.e. for those where the execution time is dominated by optimized loops.

    Similarly, I expect that for the older Meteor Lake, Raptor Lake and Alder Lake CPUs the best compilation speed is achieved with 2 x #P + #E threads, even if this should improve the compilation time by only something like 10% over that achieved with #P + #E threads. At least the published compilation benchmarks are consistent with this expectation.

    EDIT: I see from your other posting that you have used a notation that has confused me, i.e. that by #P you have meant the number of SMT threads that can be executed on P cores, not the number of P cores.

    Therefore what I have written as 2 x #P + #E is the same as what you have written as #P + #E.

    So your results are the same with what I have obtained on AMD CPUs. With SMT enabled, the optimal number of threads is the total number of threads supported by the CPU, where the SMT cores support multiple threads.

    Only with SMT disabled an extra thread over the number of cores becomes beneficial.

    The reason for this difference in behavior is that with SMT disabled, any thread that is stalled by waiting for I/O leaves a CPU core idle, slowing the progress. With SMT enabled, when any thread is stalled, the threads are redistributed to balance the workload and no core remains idle. Only 1 of the P cores runs a thread instead of 2, which reduces its throughput by only a few percent. Avoiding this small loss in throughput by running an extra thread does not provide enough additional performance to compensate the additional overhead caused by an extra thread.

  • Power management is a factor because the cores "steal" power from each other. However the E-cores are more efficient so slowing down P-cores and giving some of their power to the E-cores increases the efficiency and performance of the chip overall. In general you're better off using all the cores.

    • I suggest this depends on the exact model you are using. On Alder/Raptor Lake, the E-cores run at 4.5GHz which is completely futile, and in doing so they heat their adjacent P-cores, because 2x E-core clusters can easily draw 135W or more. This can significantly cut into the headroom of the nearby P. Arrow Lake-S has rearranged the E-cores.

  • Did you test at least +1 if not *1.5 or something? I would expect you to occasionally get blocked on disk I/O and would want some spare work sitting hot to switch in.

    • Let me test that now. Note I only have 1 Intel machine so any results are very specific to this laptop.

        -j           time (mean ± σ)
        12 (#P+#E)   130.889 s ±  4.072 s
        13 (..+1)    135.049 s ±  2.270 s
         4 (#P)      179.845 s ±  1.783 s
         8 (#E)      141.669 s ±  3.441 s
      

      Machine: 13th Gen Intel(R) Core(TM) i7-1365U; 2 x P-cores (4 threads), 8 x E-cores

      5 replies →

Sounds like a nice core, but intel's gone from "fries itself" broken in Raptor Lake to merely a broken cache/memory archecture in Lunar Lake.

None of the integer improvements make it onto the screen, as it were.

edit: I got Lunar and Arrow Lake's issues mixed up, but still there should be a lot more integer uplift on paper.

  • What do you consider broken about Lunar Lake? It looks good to me. The E-cores are on a separate island to allow the ring to power off and that does lead to lower performance but it's still good enough IMO.

I still don’t quite get how the cpu knows what is low priority or background. Or is that steered at OS level a bit like cpu pinning ?

  • When the P/E model was introduced by Intel, there was a fairly long transition period where both Windows and Linux performed unpredictably poorly for most compute-intensive work loads, to the point where the advice was to disable the E cores entirely if you were gaming or doing anything remotely CPU-intensive or if your OS was never going to be updated (Win 7/8, many LTS Linux).

    It's not entirely clear to me why it took a while to add support on Linux because the kernel already supported big.LITTLE and from the scheduler's point of view it's the same thing as Intel's P/E cores. I guess the patch must've been simple but it just took very long to trickle down to common distributions?

    • Not very surprisingly but IME running VMs you still want to pin (at least on Linux).