← Back to context

Comment by flanked-evergl

1 year ago

Is there a RISC dream? I think there is an efficiency "dream", there is a performance "dream", there is a cost "dream" — there are even low-complexity relative to cost, performance and efficiency "dreams" — but a RISC dream? Who cares more about RISC than cost, performance, efficiency and simplicity?

There was such dream. It was about getting the mind-bogglingly simple CPU, put caches into the now empty place where all the control logic used to be, and clock it up the wazoo, and let the software deal with load/branch delays, efficiently using all 64 registers, etc. That'll beat the hell out of those silly CISC architectures at performance, and at the fraction of the design and production costs!

This didn't work out, for two main reasons: first, just being able to turn clocks hella high is still not enough to get great performance: you really do want your CPU to be super-scalar, out-of-order, and with great branch predictor, if you need amazing performance. But when you do all that, the simplicity of RISC decoding stops mattering all that much, as Pentium II demonstrated when it equalled DEC Alpha on performance, while still having practically useful things like e.g. byte loads/stores. Yes, it's RISC-like instructions under the hood but that's an implementation detail, no reason to expose it to the user in the ISA, just as you don't have to expose the branch delay slots in your ISA because it's a bad idea to do so: e.g. MIPS II added 1 additional pipeline stage, and now they needed two branch/load delay slots. Whoops! So they added interlocks anyway (MIPS originally stood for "Microprocessor without Interlocked Pipelined Stages", ha-ha) and got rid of the load delays; they still left 1 branch delay slot exposed due to backwards compatibility, and the circuitry required was arguably silly.

The second reason was that the software (or compilers, to be more precise) can't really deal very well with all that stuff from the first paragraph. That's what sank Itanium. That's why nobody makes CPUs with register windows any more. And static instruction scheduling in the compilers still can't beat dynamic instruction reordering.

  • Great post as it is also directly applicable to invalidate the myth that the arm instruction set somehow makes the whole cpu better than analogous x86 silicon. It might be true and responsible for like 0.1% (guesstimate) of the total advantage; it's actually all RISC under the hood and both ISAs need decoders, x86 might need a slightly bigger one which amounts to accounting noise in terms of area.

    c.f. https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

  • > This didn't work out

    ... except it did.

    You had literal students design chips that outperformed industry cores that took huge teams and huge investment.

    Acorn had a team of just a few people build a core that outperformed an i460 with likely 1/100 investment. Not to mention the even more expensive VAX chips.

    Can you imagine how fucking baffled the DEC engineers at the time were when their absurdly complex and absurdly expensive VAX chip were smocked by a bunch of first time chip designers?

    > as Pentium II demonstrated

    That chip came out in 1997. The original RISC chip research happened in the early 80s or even earlier. It did work, its just that x86 was bound to the PC market and Intel had the finances huge teams hammer away at the problem. x86 was able to overtake Alpha because DEC was not doing well and they couldn't invest the required amount.

    > no reason to expose it to the user in the ISA

    Except that hidden the implementation is costly.

    If you give 2 equal teams the same amount of money, what results in a faster chip. A team that does a simply RISC instruction set. Or a team that does a complex CISC instruction set, transforms that into an underlying simpler instruction set?

    Now of course for Intel, they had backward comparability so they had to do what they had to do. They were just lucky they were able to invest so much more then all the other competitors.

    • > You had literal students design chips that outperformed industry cores that took huge teams and huge investment

      Everyone remember to thank our trans heroine Sophie Wilson (CBE).

    • > If you give 2 equal teams the same amount of money, what results in a faster chip.

      Depends on the amount of money. If it's less a certain amount, RISC design will be faster. If it's above, both designs will perform about the same.

      I mean, look at ARM: they too have decode their instructions into micro-ops and cache those in their high-performance models. What RISC buys you is the ability to be competitive at the low end of the market, with simplistic implementations. That's why we won't ever see e.g. a stack-like machine — no exposed general-purpose registers, but with flexible addressing modes for the stack, even something like [SP+[SP+12]]; stack is mirrored onto the hidden register file which is used as an "L0" cache which neatly solves the problem that register windows were supposed to solve, — such a design can be made as fast as server-grade x86 or ARM, but only by throwing billions of dollars and several man-millenia at it; and if you try to do it cheaper and quicker, its performance would absolutely suck. That's why e.g. System/360 didn't make that design choice although IBM seriously considered it for half a year — they then found out that the low-level machines would be unacceptably slow so they went with "registers with base-plus-offset addressed memory" design.

  • To add on to what the sibling said, ignoring that CISC chips have a separate frontend to break complex instructions down into an internal RISC-like instruction set and thus the difference is blurred, more RISC instruction sets do tend to win on performance and power for the main reason that the instruction set has a fixed width. This means that you can fetch a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel whereas x86’d variableness makes it harder to keep the super scalar pipeline full (it’s decoder is significantly more complex to try to still extract parallelism which further slows it down). This is a bit more complex on ARM (and maybe RISCV?) where you have two widths but even then in practice it’s easier to extract performance out of it because x86 can be anywhere from 1-4 bytes (or 1-8? Can’t remember) which makes it hard to find boundary instructions in parallel.

    There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).

    • x86 instruction lengths range from 1 to 15.

      > a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel

      In practice, ARM processors decode up to 4 instructions in parallel; so do Intel and AMD.

      3 replies →

But we define the RISC dream as a dream that efficiency, performance and low-cost could be achieved by cores with very small instruction sets?

  • Not small instruction sets, simplified instruction sets. RISC’s main trick is to reduce the number of addressing modes (eg, no memory indirect instructions) and reduce the number of memory operands per instruction to 0 or 1. Use the instruction encoding space for more registers instead.

    The surviving CISCs, x86 and z390 are the least CISCy CISCs. The surviving RISCs, arm and power, are the least RISCy RISCs.

    RISC V is a weird throwback in some aspects of its instruction set design.

  • If adding more instructions negatively impacts efficiency, performance, cost and complexity, nobody would do it.

    • Probably true now, but in ye olde days, some instructions existed primarily to make assembly programming more convenient.

      Assembly programming is a real pain in the RISCiest of RISC architectures, like SPARC. Here's an example from https://www.cs.clemson.edu/course/cpsc827/material/Code%20Ge...:

      • All branches (including the one caused by CALL, below) take place after execution of the following instruction.

      • The position immediately after a branch is the “delay slot” and the instruction found there is the “delay instruction”.

      • If possible, place a useful instruction in the delay slot (one which can safely be done whether or not a conditional branch is taken).

      • If not, place a NOP in the delay slot.

      • Never place any other branch instruction in a delay slot.

      • Do not use SET in a delay slot (only half of it is really there).

      1 reply →