Comment by rayiner
11 hours ago
Thanks for the explanation. I was wondering how the heck Intel did to make a 9-way decode x86–a low power core of all things. Seems like an elegant approach.
11 hours ago
Thanks for the explanation. I was wondering how the heck Intel did to make a 9-way decode x86–a low power core of all things. Seems like an elegant approach.
The important bit: Intel E-cores now have 3x decoders each with the ability for 3-wide decode. When they work as a team, they can perform 9 decodes per clock tick (which then bottlenecks to 8 renamed uops in the best case scenario, and more than likely ~4 or ~3 more typical uops).
3-4 uops per cycle is more of an average throughput than a typical throughput.
The average is dragged down by many cycles that don't decoded/rename any uops. Either waiting for bytes to decode (icache miss, etc) or rename is blocked because the ROB is full (probably stalled on a dcache miss).
So you want a quite wide frontend so that whenever you are unblocked, you can drag the average up again.