← Back to context

Comment by dragontamer

8 hours ago

The important bit: Intel E-cores now have 3x decoders each with the ability for 3-wide decode. When they work as a team, they can perform 9 decodes per clock tick (which then bottlenecks to 8 renamed uops in the best case scenario, and more than likely ~4 or ~3 more typical uops).

3-4 uops per cycle is more of an average throughput than a typical throughput.

The average is dragged down by many cycles that don't decoded/rename any uops. Either waiting for bytes to decode (icache miss, etc) or rename is blocked because the ROB is full (probably stalled on a dcache miss).

So you want a quite wide frontend so that whenever you are unblocked, you can drag the average up again.