How NASA built Artemis II’s fault-tolerant computer

5 days ago (cacm.acm.org)

255 comments

speckx

The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.

carefree-bob 4 days ago
During the time of the first Apollo missions, a dominant portion of computing research was funded by the defense department and related arms of government, making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm. Now that we have a huge free market for things like online shopping and social media, this is a bit of a neglected field and suffers from poor investment and mindshare, but I think it's still a fascinating field with some really interesting algorithms -- check out the work of Frank Mueller or Johann Blieberger.
- ehnto 4 days ago
  
  It still lives on as a bit of a hard skill in automotive/robotics. As someone who crosses the divide between enterprise web software, and hacking about with embedded automotive bits, I don't really lament that we're not using WCET and Real Time OSes in web applications!
  
  12 replies →
- sigbottle 4 days ago
  
  > making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm.
  Oh wow, really? I never knew that. huh.
  I feel like as I grow older, the more I start to appreciate history. Curse my naive younger self! (Well, to be fair, I don't know if I would've learned history like that in school...)
- therobots927 4 days ago
  
  Contrary to propaganda from the likes of Ludwig von Mises, the free market is not some kind of optimal solution to all of our problems. And it certainly does not produce excellent software.
  
  34 replies →
ggm 4 days ago
Time triggered Ethernet is part of aircraft certified data bus and has a deep, decades long history. I believe INRIA did work on this, feeding Airbus maybe. It makes perfect sense when you can design for it. An aircraft is a bounded problem space of inputs and outputs which can have deterministic required minima and then you can build for it, and hopefully even have headroom for extras.
Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.
I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.
- Arch-TK 4 days ago
  
  A real squirrel would need acorns, I would assume it's a clockwork squirrel too.
  
  1 reply →
- 21asdffdsa12 4 days ago
  
  Aircraft also have software and components, that form a "working" proclaimed eco-system in lockstep- a baseline. This is why there are paper "additions" on bug discovery until the bug is patched and the whole ecosystem of devices is lifted to the next "baseline".
arduanika 4 days ago
You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.
That alone is worth my tax dollars.
- randomNumber7 4 days ago
  
  Don’t count your chickens before they hatch.
  
  1 reply →
dyauspitr 4 days ago
Agile is not meant to make solid, robust products. It’s so you can make product fragments/iterations quickly, with okay quality and out to the customer asap to maximize profits.
- nickff 4 days ago
  
  “Agile” doesn’t mean that you release the first iteration, it’s just a methodology that emphasizes short iteration loops. You can definitely develop reliable real-time systems with Agile.
  
  7 replies →
- speedbird 4 days ago
  
  You can absolutely build robust products using agile. Apart from some of the human benefits of any kind of incremental/iterative development, the big win with Agile is a realistic way to elicit requirements from normal people.
- buster 4 days ago
  
  You hopefully know thats not true. But it's a matter of quality goals. Need absolute robustness? Prioritize it and build it. Need speed and be first to market? Prioritize and build it. You can do both in an agile way. Many would argue that you won't be as fast in a non-agile way. There is no bullet point in the agile manifest saying to build unreliable software.
  
  1 reply →
- vintermann 4 days ago
  
  The generous way of seeing it is that you don't know what the customer wants, and the customer doesn't know all that well what they want either, and certainly not how to express it to you. So you try something, and improve it from there.
  But for aerospace, the customer probably knows pretty well what they want.
- froddd 4 days ago
  
  The manifesto refers to “working software”. It does not say anything about “okay quality”.
- sylware 4 days ago
  
  ... and it mechanically promotes planned obsolescence by its nature (likely to be of disastrous quality). The perfect mur... errr... the perfect fraud.
anymouse123456 4 days ago
Some of us still work on embedded systems with real-time guarantees.
Believe it or not, at least some of those modern practices (unit testing, CI, etc) do make a big (positive) difference there.
- cpgxiii 4 days ago
  
  The depressing part is that these "modern practices" were essentially invented in the 1960s by defense and aerospace projects like the NTDS, LLRV/LLTV, and Digital Fly-by-Wire to produce safety-critical software, and the rest of the software industry simply ignored them until the last couple of decades.
iknowstuff 4 days ago
Tesla’s Cybertruck uses that in its ethernet as well!
- carefree-bob 4 days ago
  
  All the ADAS automotive systems use this, there are several startups in this space as well, such as Ethernovia.
  
  1 reply →
globnomulous 2 days ago

Microsoft fired all QA people ten or fifteen years ago. I'd imagine it's a similar a story: boxed software needed much higher guarantees of correctness. Digital deliver leaves much more room for error, because it leaves room for easier, cheaper fixes.
tayk47999 4 days ago
> “Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”
Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.
I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.
- whiskey-one 4 days ago
  
  In theory, yes you could iterate on architecture and potentially even come up with better one with agile approach.
  In practice, so many aspects follow from it that it’s not practical to iterate with today’s tools.
- crabbone 4 days ago
  
  Agile is like communism. Whenever something bad happens to people who practice agile, the explanation is that they did agile wrong, had they being doing the true agile, the problem would've been totally avoided.
  In reality, agile doesn't mean anything. Anyone can claim to do agile. Anyone can be blamed for only pretending to do agile. There's no yardstick.
  But it's also easy to understand what the author was trying to say, if we don't try to defend or blame a particular fashionable ideology. I've worked on projects that required high quality of code and product reliability and those that had no such requirement. There is, indeed, a very big difference in approach to the development process. Things that are often associated with agile and DevOps are bad for developing high-quality reliable programs. Here's why:
  The development process before DevOps looked like this:
  1. Planning 2. Programming 3. QA 4. If QA found problems, goto 2 5. Release
  The "smart" idea behind DevOps, or, as it used to be called at the time "shift left" was to start QA before the whole of programming was done, in parallel with the development process, so that the testers wouldn't be idling for a year waiting for the developers to deliver the product to testers and the developers would have faster feedback to the changes they make. Iterating on this idea was the concept of "continuous delivery" (and that's where DevOps came into play: they are the ones, fundamentally, responsible to make this happen). Continuous delivery observed that since developers are getting feedback sooner in the development process, the release, too, may be "shifted left", thus starting the marketing and sales earlier.
  Back in those days, however, it was common to expect that testers will be conducting a kind of a double-blindfolded experiment. I.e. testers weren't supposed to know the ins and outs of the code intentionally, s.t. they don't, inadvertently, side with the developers on whatever issues they discover. Something that today, perhaps, would've been called "black-box testing". This became impossible with CD because testers would be incrementally exposed to the decisions governing the internal workings of the product.
  Another aspect of the more rigorous testing is the "mileage". Critical systems, normally, aren't released w/o being run intensively for a very long time, typically orders of magnitude longer than the single QA cycle (let's say, the QA gets a day of computer time to run their tests, then the mileage needs to be a month or so). This is a very inconvenient time for development, as feature freeze and code freeze are still in effect, so the coding can only happen in the next version of the product (provided it's even planned). But, the incremental approach used by CD managed to sell a lie that says that "we've ran the program for a substantial amount of time during all the increments we've made so far, therefore we don't need to collect more mileage". This, of course, overlooks the fact that changes in the program don't contribute proportionally to the program's quality or performance.
  In other words, what I'm trying to say is that agile or DevOps practices allowed to make the development process cheaper by making it faster while still maintaining some degree of quality control, however they are inadequate for products with high quality requirements because they don't address the worst case scenarios.
guenthert 4 days ago

I think he refers to SpaceWire https://en.wikipedia.org/wiki/SpaceWire.
pjmlp 4 days ago

As 70's child that was there when the whole agile took over, and systems engineer got rebranded as devops, I fully agree with them.
Add TDD, XP and mob programming as well.
While in some ways better than pure waterfall, most companies never adopted them fully, while in some scenarios they are more fit to a Silicon Valley TV show than anything else.
mvkel 4 days ago
If you look at code as art, where its value is a measure of the effort it takes to make, sure.
- stodor89 4 days ago
  
  Or if you're building something important, like a spaceship.
- BobbyTables2 4 days ago
  
  In that case, our test infrastructure belongs in the Louvre…
- couchand 4 days ago
  
  If your implication is that stencil art does not take effort then perhaps you may not fully appreciate Banksy. Works like Gaza Kitty or Flower Thrower don’t just appear haphazardly without effort.
vasco 4 days ago
It's not like the approach they took is any different. Just slapped 8x the number of computers on it for calculating the same thing and wait to see if they disagree. Not the pinnacle of engineering. The equivalent of throwing money at the problem.
- curiousObject 4 days ago
  
  >Just slapped 8x the number of computers on it
  ‘Just’ is not an appropriate word in this context. Much of the article is about the difficulty of synchronization, recovery from faults, and about the redundant backup and recovery systems
- MikeTheGreat 4 days ago
  
  What happens when they don't?
  
  5 replies →
ramraj07 4 days ago
I take the opposite message from that line - out of touch teams working on something so over budget and so overdue, and so bureaucratic, and with such an insanely poor history of success, and they talk as if they have cured cancer.
This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.
Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.
- danhon 4 days ago
  
  You mean like this?
  "With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."
  https://blog.codinghorror.com/building-a-computer-the-google...
  
  9 replies →
- bluegatty 4 days ago
  
  No, space is just hard.
  Everything is bespoke.
  You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.
  People died on the Apollo missions.
  It just costs that much.
  
  7 replies →
- therobots927 4 days ago
  
  Modern software development is a fucking joke. I’m sorry if that offends you. Somehow despite Moore’s law, the industry has figured out how to actually regress on quality.
  
  6 replies →
- bfung 4 days ago
  
  One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.
- HNisCIS 4 days ago
  
  What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?
  
  7 replies →
- simoncion 4 days ago
  
  > ...they talk as if they have cured cancer.
  I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.
  
  2 replies →

georgehm 4 days ago

>Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

guai888 4 days ago

These CPUs are typically implemented as lockstep pairs on the same die. In a lockstep architecture, both CPUs execute the same operations simultaneously and their outputs are continuously compared. As a result, the failure rate associated with an undetected erroneous calculation is significantly lower than the FIT rate of an individual CPU.
Put another way, the FIT (Failure in Time) value for the condition in which both CPUs in a lockstep pair perform the same erroneous calculation and still produce matching results is extremely small. That is why we selected and accepted this lockstep CPU design
CubicalOrange 4 days ago

the probability of simultaneous cosmic ray bit-flip in 2 CPUs, in the same bit, is ridiculously low, there might be more probability of them getting hit by a stray asteroid, propelled by a solar flare.
but still, murphy's law applies really well in space, so who knows.
randomNumber7 4 days ago
For errors due to radiation the probability is extremely low, since it would need to flip the same bit at the same time in two different places.
- sippeangelo 4 days ago
  
  Then why 8 instead of 3?
  
  2 replies →
themafia 4 days ago
In the Shuttle they would use command averaging. All four computers would get access to an actuator which would tie into a manifold which delivered power to the flight control surface. If one disagreed then you'd get 25% less command authority to that element.
- JumpCrisscross 4 days ago
  
  > In the Shuttle they would use command averaging
  I think the Shuttle, operating only in LEO, had more margin for error. Averaging a deep-space burn calculation is basically the same as killing the crew.
  
  5 replies →
alfons_foobar 4 days ago

I wondered about this as well.
OTOH, consider that in the "pick the majority from 3 CPUs" approach that seems to have been used in earlier missions (as mentioned in the article) would fail the same way if two CPUs compute the same erroneous result.
anordin95 4 days ago

I initially found this odd too. However, I think the catastrophic failure probability is the same as the prior system, and presumably this new design offers improvements elsewhere.
Under the 3-voting scheme, if 2 machines have the same identical failure -- catastrophe. Under the 4 distinct systems sampled from a priority queue, if the 2 machines in the sampled system have the same identical failure -- catastrophe. In either case the odds are roughly P(bit-flip) * P(exact same bit-flip).
The article only hints at the improvements of such a system with the phrasing: " simplifies the complex task", and I'm guessing this may reduce synchronization overhead or improve parallelizability. But this is a pretty big guess to be fair.
FabHK 4 days ago

Indeed. It seems like system 1 and 2 could fail identically, 3, 4, 5, 6, 7, 8 are all correct, and as described the wrong answer from 1 and 2 would be chosen (with a "25% majority"??).

__d 4 days ago

Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

anthonj 4 days ago
Nasa CFS, is written is plain C (trying to follow MISRA C, etc). It's open on girhub abd used by many companies. It's typically run over freertos or RTEMS, not sure here.
Personally I find the project extremely messy, and kinda hate working with it.
- sheepybloke 3 days ago
  
  It's most likely using vxworks for it's OS, since I believe it's one of the only fully certified ARINC653 OS's for human flight. It's used in most Aircraft and space missions.
  
  1 reply →
j4k0bfr 4 days ago
Not sure about the primary FSW but the BFS uses cFS[0]. As the sibling comment mentions, you can check it out on GitHub. Sadly I believe NASA keeps most of their best code private, probably siloed into mission-specific codebases. Still, the cFS repo is an awesome crash course on old-school Flight Software techniques.
[0] https://youtu.be/4doI2iQe4Jk?si=ucMoIdw7x_QgZR32
- __d 2 days ago
  
  Helpful video, thanks!
  At about 1:20, the presenter says the BFS uses a different OS and hardware (not sure if that means a different instance, or a different class, so to speak).

TonyAlicea10 4 days ago

When I was first starting out as a professional developer 25 years ago doing web development, I had a friend who had retired from NASA and had worked on Apollo.

I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.

The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.

xavortm 4 days ago
there's a different level of 'good-enough' in each industry and that's normal. When your highest damage of a bad site is reduced revenue (or even just missed free user), you have lower motivation to do it right compared to a living human coming back in one piece.
- TonyAlicea10 4 days ago
  
  Yes, of course, but a culture of “good enough” can go too far. One may work in a lower-risk context, but we can still learn a lot from robust architectural thinking. Edge cases, security, and more.
  Low quality for a shopping cart feels fine until someone steals all the credit card numbers.
  
  2 replies →

y1n0 4 days ago

NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.

voodoo_child 4 days ago
Nice “well, actually”. I’m sure Lockheed were building this quad-redundant, radiation-hardened PowerPC that costs millions of dollars and communicates via Time-Triggered Ethernet anyway, whether NASA needed one or not.
- kube-system 4 days ago
  
  Probably, if it already wasn’t developed for DoD.
  For example, the OS it seems to be running is integrity 178.
  https://www.ghs.com/products/safety_critical/integrity_178_s...
  Aerospace tech is not entirely bespoke anymore, plenty of the foundational tech is off the shelf.
  Historically, the main difference between ICBM tech and human spaceflight tech is the payload and reentry system.
- y1n0 4 days ago
  
  This is the equivalent of prompt engineering.
jakeinspace 4 days ago

True, but BFS was mainly done in-house. Source: my best friend and I worked on some parts of it.
adrian_b 4 days ago
Lockheed Martin and their subcontractors did the implementation.
We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.
- y1n0 4 days ago
  
  I do.
  
  2 replies →
colechristensen 4 days ago

Eh, in these kinds of subcontractor relationships there is a lot of work and communication on both sides of the table.
Sebguer 4 days ago

will nobody think of the megacorps!!!
fleroviumna 4 days ago

[dead]
therobots927 4 days ago

[flagged]

geomark 4 days ago

I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.

themafia 4 days ago

One of the things I loved about the Shuttle is that all five computers were mounted not only in different locations but in different orientations in the shuttle. Providing some additional hardening against radiation by providing different cross sections to any incident event.
MutexMaven 3 days ago

NASA actually publishes these things on their NTRS page. The Primary flight controller is rocking Green Hills INTEGRITY RTOS on BAE RAD750s in a quad redundant config, with a VxWorks backup on a Frontgrade Gaisler LEON4 (SPARC V8). This allowed for parts of the ARINC653 spec regarding time and space partitioning of the RTOS scheduler to be used.
You can read more about it below (when the server throwing errors). https://ntrs.nasa.gov/api/citations/20190000011/downloads/20... https://ntrs.nasa.gov/api/citations/20230002185/downloads/FS...
MathMonkeyMan 3 days ago

I read, for probe missions, that one technique is to get a bunch of consumer chips and irradiate the hell out of them. Now take the winner model and get a bunch of those. Irradiate them. The winner goes to Mars.
The claim was that some plain old chips are exquisitely radiation resisitant, and it's not clear why.

eggy 4 days ago

Some related good books I have been studying the past few years or so. The Spark book is written by people who've worked on Cube sats:

  * Logical Foundations of Cyber-Physical Systems

  * Building High Integrity Applications with SPARK 

  * Analysable Real-Time Systems: Programmed in Ada

  * Control Systems Safety Evaluation and Reliability (William M. Goble)

I am developing a high-integrity controls system for a prototype hoist to be certified for overhead hoisting with the highest safety standards and targeting aerospace, construction, entertainment, and defense.

albertzeyer 4 days ago

I'm curious: In the current moon flyby, how often did some of these fallback methods get active? Was the BFS ever in control at any point? How many bitflips were there during the flight so far?

programmertote 3 days ago

The same question I wanted to ask. I'd be very curious to learn about their post-mission analysis to find out how many bit flips occurred and how many times this redundant system prevented the mistakes from causing issues.

nickpsecurity 4 days ago

The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

dojopico 4 days ago

I did VOS and database performance stuff at Stratus from 1989-95. Stratus was the hardware fault tolerant company. Tandem, our arch rivals, did software fault tolerance. Our architecture was “pair and spare”. Each board had redundant everything and was paired with a second board. Every pin out was compared on every tick. Boards that could not reset called home. The switch from Motorola 68K to Intel was a nightmare for the hardware group because some instructions had unused pins that could float.

sillywalk 3 days ago

> Stratus was the hardware fault tolerant company. Tandem, our arch rivals, did software fault tolerance. Our architecture was “pair and spare”.
To expand on this, when Tandem switched MIPS from their proprietary processors, the CPUs were duplicated on a board and compared, and if they disagreed, the logical CPU would halt, similar to Stratus. The software-pair backup processes in a different logical CPU would then take over.

gambiting 4 days ago

So honest and perhaps a bit stupid question.

Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?

EdNutting 4 days ago
They’re not mission-critical equipment. If they fail, nobody dies.
They’re not radiation hardened, so given enough time, they’d be expected to fail. Rebooting them might clear the issue or it might not (soft vs hard faults).
Also impossible to predict when a failure would happen, but NASA, ESA and others have data somewhere that makes them believe the risk is high enough that mission critical systems need this level of redundancy.
- gambiting 4 days ago
  
  >>They’re not mission-critical equipment. If they fail, nobody dies.
  Yes, for sure, but that's not my question - it's not a "why is this allowed" but "why isn't this causing more visible problems with the iphones themselves".
  Like, do they need constant rebooting? Does this cause any noticable problems with their operation? Realistically, when would you expect a consumer grade phone to fail in these conditions?
  
  2 replies →

jbritton 4 days ago

I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.

EdNutting 4 days ago
Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c...
Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).
And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).
- EdNutting 4 days ago
  
  Typo: “both” ~ “not”
Tomte 4 days ago

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).
That was in the 2000s though, and for embedded memory above 65nm.
And obviously on earth.
tosapple 4 days ago

[dead]

starkparker 5 days ago

Headline needs its how-dectomy reverted to make sense

arduanika 4 days ago

(Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.

rurban 3 days ago

Raft consensus with pairs? I smell bulls*t there. Even when they say it's 8, it boils down to pair-wise checks, without any consensus. Just the consensus of wrong.

Also https://en.wikipedia.org/wiki/TTEthernet looks like bolting time-guaranteed switching networks onto randomizing ethernet hardware. Sounds incredibly cheap and stupid. Either stay with guaranteed real-time switching, or give up on hard real-time guarantees and favor performance, simplicity and cheap stock hardware.

Monkeys in space.

estimator7292 3 days ago

Extremely bold take to insist you're smarter than people who literally just flew to the actual moon and back.

Schlagbohrer 4 days ago

"High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover."

I assume this means they are using a digital twin simulation inside the HPC?

MutexMaven 3 days ago
Yes, they leveraged Intel Simics and many other tools like Matlab etc. to have "Digital Twin" simulations.
The extensive use of simulators and emulators has been particularly critical, enabling parallel design and development workflows to compensate for the incredibly expensive and long-lead times of hardware. So this helped with bottlenecks in development too.
https://ntrs.nasa.gov/api/citations/20190000011/downloads/20...
- MutexMaven 3 days ago
  
  Wrong link, but still relevant. https://ntrs.nasa.gov/api/citations/20140009920/downloads/20...

dom111 4 days ago

I always wondered if the "radiation hardening" approaches of the challenges like this https://codegolf.stackexchange.com/questions/57257/radiation... (see the tag for more https://codegolf.stackexchange.com/questions/tagged/radiatio...) would be of any practical use... I assume not, as the problem is on too many levels, but still, seems at least tangentially relevant!

kev009 4 days ago

Some people are claiming it's the good old RAD750 variant. Is there anything that talks about the actual computer architecture? The linked article is desperately void of technical details.

u1hcw9nx 4 days ago

It's a new (2002) variant of the same RAD750 architecture.

  CPUs:  IBM PowerPC 750FX (Single-core,  900 MHz, 32-bit, radiation hardened) 
  RAM:  256 MB (per processor)
  OS: VxWorks (Real-time OS)
  Network: TTEthernet (Time-Triggered Ethernet) at 1 Gbps
  programming: MISRA C++, flight control laws from Simulink adn MATLAB.

JumpCrisscross 4 days ago

Does anyone know how this compares to Crew Dragon or HLS?

object-a 4 days ago

How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything

numpad0 4 days ago

Orbital datacenters is a hypothetical infinite money glitch that could exist between the times:
- after general solution to extra-terrestrial manufacturing bootstrap problem is found, and, - before the economy patches the exploit that a scalable commodity with near-zero cost and non-zero values can exist.
It'll also destroy commercial launch market, because anything of size you want can be made in space, leaving only tiny settler transports and government sovreign launches to be viable, so not sure why commercial space people find it to be a commercially lucrative thing? The time frame within this IMG can exist can also be zero or negative.
The assumption is also like, they'll find a way to rent out some rocks for cash, so anyone with access to rocks will be doing as it becomes viable, and so, I'm not even sure if "space" part of space datacenters even matter. Earth is kinda space too in this context.
pjerem 4 days ago

Orbital data centers are still nothing more than the current hyperloop.
totetsu 4 days ago
They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?
- pclmulqdq 4 days ago
  
  Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.
- kersplody 4 days ago
  
  NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...
- linzhangrun 4 days ago
  
  It seems not; anti-interference primarily relies on using older manufacturing processes, including for military equipment, and then applying an anti-interference casing or hardware redundancy correction similar to ECC.
  
  1 reply →
aidenn0 4 days ago

You don't need 4x redundancy for everything. If no humans are aboard, you have 2x redundancy and immediately reboot if there is a disagreement.
willdr 4 days ago

Orbital data centres are a stupid concept.

vhiremath4 4 days ago

> “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”

It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!

guenthert 4 days ago

Multiple and dissimilar redundancy is nice and all that, but is there a manual override? Apollo could be (and at least in Apollo 11 and 13 it had to), but is this still possible and feasible? I'd guess so, as it's still manned by (former) test pilots, much like Apollo.

ck2 4 days ago

if I remember correctly the space shuttle had four computers that all did the same processing and a fifth that decided what was the correct answer if they all didn't match or some went down

can't find a wikipedia article on it but the times had an article in 1981

https://www.nytimes.com/1981/04/10/us/computers-to-have-the-...

apparently the 5th was standby, not the decider

lrvick 4 days ago

NASA describes some impressive work for runtime integrity, but the lack of mention of build-time security is surprising.

I would expect to see multi-party-signed deterministic builds etc. Anyone have any insight here?

ranger207 4 days ago
What would the threat profile be here to require that? Regardless, I'd be a little surprised if they didn't have anything like that; provenance is very important in aerospace, with hardware tracked to the point that NTSB investigators looking at a crash can tell what ingot a bolt was made from
- lrvick 4 days ago
  
  In my experience government just uses RedHat which is -not- reproducible and -not- full source bootstrapped so a single person in the supply chain could maliciously or accidentally backdoor everything. Maybe the goal of the supply chain attacker is just embarrassing the Americans at best or cause a material loss of life at worst.
  I would -hope- NASA does not trust their OS supply chains to a single person for high risk applications, but given even major companies I audit do this with billions of dollars on the line, it would not shock me if NASA has the same stance which worries me a bit.
  They would need to be using something like heavily customized buildroot or stagex to produce deterministic OS images.

bharat1010 4 days ago

The part about triple-redundant voting systems genuinely blew my mind — it's such a different world from how most of us write software day to day, and honestly kind of humbling.

doublerabbit 4 days ago

The Hyperia roller coaster ride at Thorpe Park uses triple-redundant voting. Which I thought was cool.
> It’s a complex machine. There’s three computers all talking to each other for a start, and they have to agree on everything.
Primary, Real-Time Secondary and Third for regulating votes.
https://www.bbc.co.uk/news/articles/ckkknz9zpzgo
sebazzz 4 days ago

I wonder how the voting components are protected from integrity failures?

PunchyHamster 4 days ago

I wonder how they made the voted-answer-picker fail-resistant

spaceman123 4 days ago

Probably same way they’ve built fault-tolerant toilet.

jeron 4 days ago

ctrl+f toilet, thank you for already commenting this

stevepotter 4 days ago

It would be nice to see some of the software source. I’m super interested and i think I helped pay for it

pbronez 4 days ago

The Artemis computer handles way more flight functions than Apollo did. What are the practical benefits of that?

This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?

SeanAnderson 4 days ago

Typo in the first sentence of the first paragraph is oddly comforting since AI wouldn't make such a typo, heh.

Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.

tux 4 days ago
I think everyone should now make mistakes so we ca distinguish human vs ai.
- zeristor 4 days ago
  
  This can be optimised for no doubt, adversarial training is like that

0xblinq 4 days ago

They should have also built a fault tolerant toilette.

RobRivera 4 days ago

2 outlooks.

Two.

hpcgroup 4 days ago

[flagged]

veunes 4 days ago

[dead]

temptemptemp111 4 days ago

[dead]

perarneng 4 days ago

[dead]

ConanRus 4 days ago

[dead]

hulitu 4 days ago

They run 2 Outlook instances. For redundancy. /s

huxleyFiddler 4 days ago

[flagged]

treesknees 4 days ago

Don't post generated comments or AI-edited comments. HN is for conversation between humans.
https://news.ycombinator.com/newsguidelines.html
Davidbrcz 4 days ago
> Dissimilar redundancy eliminates that risk. A completely different OS, different codebase, different development team.
Not entirely true. I've heard during my uni years of a case were two independent teams used the same textbook for implementing a feature, which had an error, and thus resulting in the same failure mode.
- ragebol 4 days ago
  
  Ha, very curious what the issue was and what textbook

ajaystream 4 days ago

[flagged]

adrian_b 4 days ago

This is similar to the difference between using error-correcting codes and using erasure codes combined with error-detecting codes.
The latter choice is frequently simpler and more reliable for preventing data corruption. (An erasure code can be as simple as having multiple copies and using the first good copy.)
sammy2255 4 days ago

Spoken like an LLM.
randomNumber7 4 days ago

> make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness
Does this mean you have to trust the already compromised system?
high_na_euv 4 days ago

How you can remove component from decision set if it is the only component in the whole decision set?

seemaze 4 days ago

and yet.. https://news.ycombinator.com/item?id=47615490

adrian_b 4 days ago

That was a laptop, not one of the Artemis computers.

GautamB13 4 days ago

It kinda crazy how this mission didn't become mainstream media until as of late.