Comment by nocoiner
1 day ago
To this day, “Sun E10000 Starfire” is basically synonymous in my head with “top-of-the-line, bad-ass computer system.” What a damn cool name. It made a big impression on an impressionable youth, I guess!
1 day ago
To this day, “Sun E10000 Starfire” is basically synonymous in my head with “top-of-the-line, bad-ass computer system.” What a damn cool name. It made a big impression on an impressionable youth, I guess!
Somebody gave a talk about the E10K at one of the early DefCon conferences and I was blown away. Having only worked with x86 architecture servers I couldn't believe the kind of "magic" dynamic reconfiguration enabled. I'm sad I never got to work with one.
I agree on all counts, but the installation I had at my job at the time regularly needed repairs..! Hopefully this was an exceptional case, but it gave me the impression of “redundancy added too much complexity to make the whole reliable.”
ETA: particularly because the redundancy was supposed to make it super reliable
We got the first E15000 outside of Sun when I was at SDSC; engineers from down the street at Towne Centre Drive came by to set it up...It was running Solaris 8 w/ a very specific kernel patch to make it boot; and the driver for the chassis fan control had never been completed, so they were running at 100% once the system powered on. It was like standing next to a Harrier doing a VTOL takeoff.
Also, when the system disk on the boot drawer failed, I discovered that it wan't a standard Sun FCAL or SCA-80 hdd, but a 68-pin scsi drive mounted to what appeared to be a custom-made drive cage that was unlike anything else we had on the floor. It was a real factory prototype.
I worry about this sometimes, there is this long tail of "reliability" you can chase, redundant systems, processes, voting, failover, "shoot the other node in the head scripts" etc. But everything adds complexity, now it has more moving parts, more things that can go wrong on weird ways. I wonder if the system would be more reliable if it were a lot simpler and stupid, a single box that can be rebooted if needed.
It reminds me of the lesson of the Apollo computers, The AGC was to more famous computer, probably rightfully so, but there were actually two computers, The other was the LVDC, made by IBM for controlling the Saturn V during launch, now it was a proper aerospace computer, redundant everything, a can not fail architecture, etc. In contrast the AGC was a toy, However this let the AGC be much faster and smaller, instead of reliability they made it reboot well, and instead of automatic redundancy they just put two of them.
https://en.wikipedia.org/wiki/Launch_Vehicle_Digital_Compute...
There is something to be learned here, I am not exactly sure what is is, worse is better?
No, I think that was typical. Nostalgia tends to gloss over the reality of how dodgy the old unix systems were. The Sun guy had to show up at my site with system boards for the SPARCcenter pretty regularly.