Comment by lebuffon

2 days ago

I quickly skimmed the instruction set and did not see anything resembling a sub-routine call or branch and link instruction.

Did I miss it?

The "R exp" is subroutine call (which saves return address to register B00), and I believe "J Bjk" is the subroutine return.

The Cray-1 didn't have a hardware stack, so subroutine call is basically just jump there and back, using a register for the return address rather than pushing/popping it to/from the stack.

Another oddity of the instruction set that stands out (since I'm in process of defining a VM ISA for a hobby project) is that the branch instructions test a register (A0 or S0) rather than look at status flags. In a modern CPU a conditional branch, if (x < y), is implemented by compare then branch where the compare instruction sets flags as if it had done a subtraction, but doesn't actually modify the accumulator. In the Cray this is evidentially done by doing an actual subtraction, leaving the result in A0, then branching by looking at the value of A0 (vs looking at flags set by CMP).

Gemini explains this as being to help pipelining.

  • I recall when reading TAOCP that Knuth's MIX assembly supported subroutines by requiring the caller to modify the RET call to it's own address (obviously not re-entrant!). This sort of thing was common when Knuth started in the early 60's, may have still been around by the time of the Cray.

    • This Cray version isn't so bad - it just requires that if the callee itself calls other subroutines, then it has to save/restore this B00 return address register. You could still support re-entrant routines with this as long as you saved/restored the return address to a software stack rather than a fixed location, but I wonder if Cray compilers typically supported that?

      Apparently the reason for using a register vs stack for return address was because memory access (stack) was so much slower.

      I'm kinda tempted to use this for the VM I'm designing, which will run on the 6502 8-bit micro. The 6502 doesn't have any 16-bit operations, so pushing and popping using a software defined 16-bit stack pointer is slow, and saving return address to zero page would certainly be faster. It'd mean that rather than always pushing/popping the SP, you only do it in the callee if needed. It's an interesting idea!

    • Cray's design for Control Data before starting his own company was interesting. You were required to start each subroutine with a jump instruction, and the subroutine call instruction would modify the memory at that location to a jump back to the caller. To return from a subroutine you would just branch back to its beginning.

Subroutine calls are for the weak :)

There's some more detail here: https://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM...

The following quote gives some sense of how "manual" this was:

> "On execution of the return jump instruction (007), register Boo is set to the next instruction parcel address (P) and a branch to an address specified by ijkm occurs. Upon receiving control, the called routine will conventionally save (Boo) so that the Boo register will be free for the called routine to initiate return jumps of its own. When a called routine wishes to return to its caller, it restores the saved address and executes a 005 instruction. This instruction, which is a branch to (Bjk), causes the address saved in Bjk to be entered into P as the address of the next instruction parcel to be executed."

Details were up to the compiler that produced the machine code.

  • Essentially, the B00 register is a Top Of (Return) Stack or TOS register. It’s great for leaf routines.

    You have to push it to your preferred stack before the next operation. You do the cycle-counting to decide if it’s a good ISA for your implementation, or not.

    Obviously, ISAs with a JSR that pushes to stack are always using an extra ALU cycle for the SP math, then a memory write.

    Doing it with a (maybe costless) register transfer followed by (only sometimes) a stack PUSH can work out to the same number of cycles.

    With improvements in memory speed or CPU speed, that decision can flip.

    Consider that in this era, your ALU also had the job of incrementing the PC during an idle pipeline stage (maybe the instruction decode). Doing a SP increment for a PUSH might compete with that, so separating the two might make the pipeline more uniform. I don’t know any of the Cray ISAs so this is just a guess.