← Back to context

Comment by chadd

6 days ago

I'm literally guest lecturing at a Harvard class tomorrow on systemic failures in decision making, using the Columbia and Challenger disasters as case studies, and changed my slides last night to include Artemis II because it could literally happen again.

This broken safety culture has been around since the beginning of the Shuttle program.

In 1980, Gregg Easterbrook published "Goodbye, Columbia" in The Washington Monthly [1], warning that NASA's "success-oriented planning" and political pressure were creating the conditions for catastrophe. He essentially predicted Columbia's heat shield failures in the article 1 year before the first flight.

Challenger in 1986, and the Rogers Commission identified hierarchy, communication failures, and management overriding engineering judgment.

Then Columbia happened in 2003. The CAIB found NASA had not implemented the 1986 recommendations [2].

Now Charles Camarda (who flew the first shuttle mission after Columbia and is literally a heat shield expert!) is saying it's happening again.

[1] https://www.iasa-intl.com/folders/shuttle/GoodbyeColumbia.ht...

[2] Columbia Accident Investigation Board Report, Chapter 8: https://www.nasa.gov/columbia/caib/html/start.html

> This broken safety culture has been around since the beginning of the Shuttle program.

It's broken everywhere. I have worked in some dysfunctional shops and the problem I see time and time again is the people who make it into management are often egoists who don't care about anything other than the financial compensation and clout the job titles bestows upon them. That or they think management is the same as being a shotgun toting sheriff overseeing a chain gang working in the summer heat in the deep south.

I've worked with managers who would argue with you even if they knew they were wrong because they were incapable of accepting humiliation. I worked with managers who were wall flowers so afraid of confrontation or negative emotions that they covered up every issue they could in order to avoid any potential negative interaction with their superiors. That manager was also bullied by other managers and even some employees.

A lot of it is ego along with a heavy dose of machismo depending. I've seen managers let safety go right down the tubes because "don't be a such a pussy." It's a bad culture that has to go away.

  • A simplistic answer would be to ensure that incentives are aligned with safety and success. Then that leads to the evergreen problem of Goodhart’s Law (when a measure becomes a target, it ceases to be a good measure).

    Even if it can't ever be truly fixed, at least recognizing the issues and shining daylight on decisions for some form of accountability should be a base-level approach.

  • > people who make it into management are often egoists

    > they were incapable of accepting humiliation

    I agree mostly but here is a different take on it: I think these are normal human feelings and behaviors - not the best of us, but not unusual either. If we want to get good things done, we need to work with and through human nature. Power corrupts everyone and shame is generally the most painful thing for humans.

    Putting people in a position where they need to treat their power with absolute humility or accept humiliation (and a major blow to their careers) in order to do the right thing is going to fail 99% of the time. (I'm not saying people can't do those things and that we shouldn't work hard and aspire to them, but it's not going to happen reliably with any but a few people.) That expectation itself is a culture, organizational and managerial failure. If you see a system in which so many fail, then the problem is the system.

    And when I say 'managerial' failure, I include leadership by everyone and also 'managing up'. We're all responsible for and agents of the team's results, and whatever our role we need to prevent those situations. One important tactic is to anticipate that problem and get ahead of it, putting the team in a position where the risk is proactively addressed and/or they have the flexibility to change course without 'humiliation'. We're all responsible for the team's culture.

    I think many blaming others underestimate their own human nature, the effect of power on them and their willingness to endure things like humiliation. Rather than criticising others, I keep my attention on the one in the mirror and on strategies to avoid situations equally dangerous to my own character; otherwise I'll end up doing the same very human things.

    EDIT: While I still agree with everything I wrote above, there is an exceptional cultural problem here, one which you'll recognize and which is common to many SV leaders, the Trump administration, and others you're familiar with (and which needs a name ...). From the document referenced in the OP by "heat shield expert and Shuttle astronaut Charles Camarda, the former Director of Engineering at Johnson Space Center."

    "Instead, the meeting started with his [Jared Isaacman, the new NASA Administrator's] declaration that the decision was final. We would launch Artemis II with a crew, even though the uncrewed Artemis I mission around the Moon returned with a seriously damaged heat shield, a failure in my opinion. I was not going to be allowed to present my position on why the decision was flawed. Instead, the public would hear, through the two reporters allowed to attend, the Artemis Program narrative, only one side of the story. They would be bombarded with technical information which they would have very little time to understand ...

    Jared could claim transparency because the only thermal protection expert and public dissenter, me, was present. ...

    I was allowed only one-day to review some of the technical documents which were not open to the public and which were classified Controlled Unclassified Information/International Traffic and Arms Regulations (CUI/ITAR) prior to the Jan.8th meeting. ..."

    https://docs.google.com/document/d/1ddi792xdfNXcBwF8qpDUxmZz...

    • > Putting people in a position where they need to treat their power with absolute humility or accept humiliation (and a major blow to their careers) in order to do the right thing is going to fail 99% of the time.

      I don't know... we select those people. Usually not for their ability to treat their power with humility, though.

      That's my argument in favour of quotas (e.g. for women): the way we select people in power now, we tend to have white old males who have the kind of relationship we know with power.

      By deciding to select someone different (e.g. a woman), we may realise that not all humans are... well white old males. Not that we should select someone incompetent! But when we put someone in a position of power, I am convinced that many competitors are competent. We just tend to chose "the most competent" (with some definition of "the most"), which may not mean anything. For those positions, maybe it's more that either you are competent, or you are not.

      Say from all the "competent" candidates, we systematically selected women for a while. We would end up with profiles that are not "white old males", and we may realise that it works just as well. Or even better. And that maybe some humans can treat power with humility.

      And if that got us to accept that those are desirable traits for people in power, it may serve men as well: plenty of men are generally not selected for positions of power. Forcing us to realise this by having quotas of minorities (say women) may actually help "white old males who can treat their power with humility" get recognised eventually.

      12 replies →

    • >If you see a system in which so many fail, then the problem is the system.

      Nope, it's start off with individuals, often way up above. Calling a system problematic is, essentially saying no one is responsible.

      2 replies →

The most frustrating part of the whole thing is that when you read Charles Camarda’s thoughts after his meeting with NASA in January, it could have been written in 1986 or in 2003.

https://docs.google.com/document/u/1/d/1ddi792xdfNXcBwF8qpDU...

It’s pretty clear at this point that the shuttle was already broken at design. But seeing the same powder keg of safety/budget/immovable time constraints applied to a totally different platform decades in the future feels like sitting through a bad movie for the third time.

What strikes is not the systemic failures. But the intense culture of secrecy.

Reports are heavily redacted. They aren't shared. Failures aren't acknowledged. Engineering models aren't released. That secrecy eventually causes what we see today.

It’s fundamentally a human coordination problem that cannot be solved

The more populated and complex an organization gets it becomes impossible to maintain a singular value vector (get these people around the moon safely)

Everyone finds meta vectors (keep my job, reduce my own accountability) that maintain their own individual stability, such that if the whole thing fails they won’t feel liable

  • It can't be solved 100%, but it can be _mostly_ solved with systemic buy-in to the safety culture. Commercial aviation is a great example IMO.

    We've spent the last several decades making sure that every single person trained to participate in commercial aviation (maintenance, pilots, attendants, ATC, ground crew) knows their role in the safety culture, and that each of them not only has the power but the _responsibility_ to act to prevent possible accidents.

    The Swiss Cheese Model [1] does a great job of illustrating this principle and imparting the importance of each person's role in safety culture.

    A big missing piece with manned space flight IMO is the lack of decision-making authority granted to lower staff. A junior pilot acting as first officer on their very first commercial flight with real passengers has the authority to call a go-around even if a seasoned Captain is flying the plane. AFAIK no such 'anyone can call a no-go' exists within NASA.

    [1] https://en.wikipedia.org/wiki/Swiss_cheese_model

    • Safety culture requires the ability to learn from mistakes, the capability to ground planes (without that turning into a political problem), and someone to foot the bill. (Which did not always happen, Boeing MCAS with a SPoF AoA sensor without retraining. A chain of cost-cutting decisions. And of course there were usual problems with market distorting subsidies to both Boeing and Airbus.)

      NASA's missions are way too big, because the science payloads are unique, so they "can't do" launch early, launch often. And then things sit in storage for years, waiting for budget. (And manned flights are in an even worse situation of course, because they are two-way.)

      And there's too much sequential dependency in the marquee projects (without enough slack to be able to absorb problems if some earlier dependent outcome is unfavorable), or in other words because of time and cost constraints the projects did not include enough proper development, testing, verification.

      NASA is doing too many things, and too much of it is politics. It should be more like a grant organization, rewarding cost-efficient scientific (and engineering) progress, in a specific broad area ("spaaace!"), like the NIH (but hopefully not like the NIH).

      6 replies →

    • No, CRM is a disaster you clearly are not in aviation. The reliability in aviation came from incredibly strict regulation and engineering improvements, NOT from structural alignment of parties. They were forced to get safer by the government if you can believe there was a time where the government did anything useful at all.

      I could go off for literally hours on this topic but suffice to say I’ve done an unbelievable amount of CRM as an officer in the United States Air Force who flew on and executed 100s of combat missions in Iraq

      My friends from Shell 77 are all dead because of CRM failures

      Sounds like you need to watch the Rehearsal

      1 reply →

    • Yes and... NASA space programs (doing rare, unknown things) are different than commercial aviation (doing a frequent, known thing with high safety). Best be careful applying solutions from the latter to the former.

      Layering additional safety layers on top of a fundamentally misaligned organization process also generally balloons costs and delivery timelines (see: NASA).

      The smarter play is to better align all stakeholders' incentives, from the top (including the president and Congress) to the bottom, to the desired outcome.

      Right now most parties are working towards very different goals.

  • 100% agree, and I definitely see this in the tech industry and it all begins and ends with psychological safety. Right now there’s job pressure in tech which creates this toxic sense coming from management that they can fire any one at any time because they don’t like you. It essentially fosters this culture to not rock the boat or “piss off the wrong person.” The result is, you keep your mouth shut or significantly risk being penalized on your annual performance review. Add inflation and the ever-rising cost of living. For an individual contributor or even front line management, the choice is very clear. This is obviously a recipe for catastrophe when you’re dealing with human lives.

    When you’re a rocket scientist at NASA, you also have relatively few alternatives other than SpaceX or Boeing.

  • There's a deeper problem behind a lot of this:

    The problem is that it is treated as if people's jobs are to do X. What happens when someone says given the problem constraints it's impossible to safely do X? The naysayers get replaced, X gets done anyway.

    Actual safety only comes when there is an external agency who monitors safety and accomplishing X is not part of their objective.

But in the 80s I guess there was the pressure to one-up the Soviets, so everything had to be done yesterday, but Artemis has existed most of my adult life at various levels of maturity (Orion and its predecessors certainly did), and considering its been more time spent between that famous Kennedy speech and the actual Moon landing (where there was apparently no issue with safety culture).

Considering how much humanity has allegedly advanced since then, I don't understand what are we gaining thats caused us to have to abandon safety.

As an aside, do you have any suggestions for "state of the art" reading on safety culture?

  • Although not especially “current,” Normal Accidents: Living with High-Risk Technologies is a 1984 book by Yale sociologist Charles Perrow, which analyses complex systems from a sociological perspective. Perrow argues that multiple and unexpected failures are built into society's complex and tightly coupled systems, and that accidents are unavoidable and cannot be designed around. Several historical disasters are analysed. I read a newer edition published in 1999, and the author had added a chapter on Chernobyl, which turned out to be a textbook example of some Perrow’s theory (in particular, that adding fail-safes also adds complexity, thus not necessarily making for any more safety. The Chernobyl disaster was precipitated at least in part, because they were on a tight schedule to test a fail-safe system.) The book is fascinating and a good page turner, hard to put down. Perrow’s book is best combined with a reading of The Doomsday Machine: Confessions of a Nuclear War Planner, by Daniel Ellsberg.

    • I'm a retired neurosurgical anesthesiologist (38 years in practice). I read Perrow's book several years after it was published. I was struck by how relevant his points of failure were to the practice of anesthesiology, the concept of the danger of tight coupling. I referred to this book over subsequent decades in my presentations on Grand Rounds, but to my knowledge none of the residents or other attendings ever read it.

      Read a sample here: https://www.amazon.com/Normal-Accidents-Living-High-Risk-Tec...

  • Other books I’ve much enjoyed, when your interest is in structural or other failures:

    Why Buildings Fall Down: How Structures Fail by Matthys Levy and Mario Salvadori, a wide ranging history of structural failures of various kinds, and their causes.

    Ignition!: An Informal History of Liquid Rocket Propellants by John Drury Clark, which is a personal memoir from a senior researcher with many decades experience developing rocket fuels - he is the proverbial Rocket Scientist. Most interesting, and amusing (in a morbid way), is the quite different culture of safety “back in the day” of this somewhat esoteric engineering/chemistry field.

    (okay, I'll stop now!)

  • I just had a conversation about engineers not understanding the need for grounding.

    I'm wondering if every generation has to relearn the basics for themselves through experience.

    Each generation has to make the same mistakes. Because book learning doesn't seem to do it for some things.

    • Sure. Even a history of safety success contributes to this. We haven't had an accident in 3000 days, what was dangerous about this job again? Also what's this stupid policy for anyway, I've never seen anybody even come close to (non-dangerous-sounding fate) while working here.

      But probably the policy is in place because it used to happen before the policy was in place. It's just not obvious to people who have never seen the consequences before.

      5 replies →

  • Learn about failures.

    Inviting Disaster: Lessons From the Edge of Technology was one of the texts for an aerospace class I didn't take but friends did, but honestly you can just read the book.

    There are lots of frameworks for teaching safety and programs for compliance and such but they are far too easy to cargo cult if you don't appreciate safety and the need for safety culture and UNDERSTAND what failures look like.

    And when you really understand the need and how significant failures happened... "state of the art" tools and practices take a back seat, they can be useful but they're just tools. What you need is people developing the appropriate vision, and with that the right things tend to follow.

``It is difficult to get a man to understand something, when his salary depends on his not understanding it.''

  • Isn't NASA run by the government? Why not pay people to do their job correctly?

    • The word "government" doesn't magically erase all the same individual & institutional incentives, ambitions, biases, & flaws that exist elsewhere.

      And sometimes, the extant magical belief that "government" is different & immune lets those same human factors be ignored until they feed bigger, slower disasters that everyone is afraid to admit, because (ostensibly) "we all did this together".

    • The role of for-profit companies and 'shareholder' value in explaining corporate bad behavior is highly overstated. The only profit that matters is the one at the individual level (i.e., compensation, which is a form of profit, for the individual).

      A government employee or a private corporation doesn't matter. To the actual humans, they are the same, in that each provides a particular compensation, tied to their decisions.

    • Is "Why not pay people to do their jobs correctly?" a way of voicing frustration with massive gov't incompetence? Or a way of saying that organizational incompetence is top-down?

      1 reply →

    • Just because you pay people doesn't mean they do their job correctly.

      It just gives you the option of not paying them if they don't do their job correctly.

    • Because the grifters are running the show. The point is not to fly to orbit/moon/mars/whatever, but shovel taxpayer money to politically well connected large aerospace contractors.

[flagged]

  • What you have written here is pretty much exactly the contents of the article we are all commenting on.

  • Didn't they have a crash dummy in it the last time? The data from buster should be able to tell us if the parachute worked or not.

  • Those astronauts don’t have anyone that loves them at home because no way in hell would any of my loved ones let me be a sacrificial turkey in a fully automated oven.

    • They do, but they are not in a position to judge. Same way as the Challenger crew despite NASA and astronauts saying, "we would not fly we would not believe to be safe enough".

It is bound to happen again and again considering humans are so oblivious to safety.

  • humans are so oblivious to safety

    It seems that in modern times, humans focus on safety almost to the exclusion of everything else. As much as the more traditional salutations "godspeed" or "have a nice day", we're even more likely to hear "drive safe" or "have a safe trip" or "be safe".

    We're very nearly paralyzed by insisting that everything must be maximally safe. Surely you've heard the mantra "...if it saves just one life...".

    The optimal amount of tragedy is not zero. It's correct that we should accept some risk. We just need to be up-front and recognize what the safety margins really are.

    • > We're very nearly paralyzed by insisting that everything must be maximally safe.

      Are we? People saying "have a safe trip" is pretty weak evidence.

      The counter evidence is just about everything else going on, at least in the US. Relaxed worker safety standards, weakened environmental protections, and generally moving as fast as possible.

      35 replies →

    • Considering that driving (at least in the US) is a relatively unsafe means of travel compared to the alternatives, I can understand imploring someone to drive safe.

      1 reply →

    • Our internal emotional thinking doesn't work very well with probabilities so it is a very common fallacy trying to reduce a probability to zero when it is completely irrational.

    • I feel like all the responses to your comment sort of prove its point.

      As I was reading the post I was wondering along the same lines, if this is different from before. Going to space is an inherently risky activity. It's always going to be easy to write the "this is not safe" think piece, where you can either say "I told you so" or "Whew, thankfully we made it this time!" afterwards. Things like this only happen when you accept some risk and people say "yes" press forward.

      All that said, not all risk is equal, and I'm trying to understand if NASA is uniquely dysfunctional now and taking needless, incidental risks.

    • America has been craving safety since 9/11, and it has made cowards of everybody, so in some sense I would agree.

      But taking a risk regarding an unknown or to expand knowledge or actually accomplish something is one thing. Ignoring known and mitigable risks just to save money, save face, meet a deadline or please a bureaucrat is another.

      Anyway these clowns even fail your criterion, because by covering up the results of the first launch/experiment, they are not being up front about a risk.

      In my opinion this is a top-down, human hierarchy thing. CEOs and agency administrators create and set an organization's culture and expectations.

      The irony is that a faulty heat shield is an engineering challenge that real engineers would love to tackle; all you have to do is turn them loose on the problem, let them fix it. They live for that. I find it actually aesthetically offensive that the organization and its culture has instead taught them venal, circumspect careerism, which is cowardice of a different kind.

  • Maybe not so much "oblivious to safety" as "oblivious to probable risk." We worry to much about low risk events (like airline flights) and don't worry enough about higher risk events (like trips-and-falls, driving a car, poor diet...)

  • I wouldn’t say humans are oblivious to safety. The Apollo program was very successful as long as you’re not related to Gus Grissom, Ed White or Roger Chaffee. But those three (preventable) deaths aside, Apollo was quite successful and figured out some huge problems.

    If you’re interested in a heck of a good read, the Columbia Accident Investigation Report is a good place to start:

    https://ehss.energy.gov/deprep/archive/documents/0308_caib_r...

    It looks at the safety culture in NASA and at how that safety culture ran into budget issues, time pressure and a culture that ‘it’s always been okay’. But people were aware of the problems.

    There’s a really frustrating example from Columbia where engineers on the ground badly wanted to inspect the shuttle’s left wing from the ground using ground based telescopes or even observations from telescopes or any other assets. There’s footage available was an email circulated where an engineer all but begged anyone to take a look with anything. That request was not approved - they never looked.

    Realistically there’s a point to be made that NASA wasn’t capable of saving those astronauts at that point. But they had a shuttle almost ready to to, they could have jettisoned its science load and possibly had a rescue of some sort available. They never looked though but alarm bells were ringing.

    It’s more accurate to say people are highly aware of safety but when you get a bunch of us together, add in cognitive biases and promotion bands we can get stuck in unsafe ruts.

    • I'd say it's more accurate to say the people who are actually smart work as engineers. Leadership is generally engineers who were better at office politics than engineering, or just business majors etc.

      So you have a group of really talented people using their talents to do awesome things, and then you have some useless idiots who are good at kissing the right asses, running the show and taking most of the credit. And that's how you end up killing astronauts, because the useless assholes in charge aren't even competent enough to recognize when they should listen to the brains of their operation. All they care about is looking good to their superiors and hitting some arbitrary deadline they've decided to set for no damn reason etc.

  • Then explain the Apollo program, and the actual printed literature that came out of the program that summarized how they were successful.

    • If you're looking for programs where mistakes were not made, Apollo is not the program to choose. I highly recommend visiting Kennedy Space Center some time where they go in-depth on how close it came to never happening after Apollo I. https://en.wikipedia.org/wiki/Apollo_1

      That being said, I'm a big proponent of "you can't make ICBM's carrying humans 100% safe", but you sure can try your best.

    • Apollo killed three astronauts. NASA learned some lessons from that and the rest of the program was safer, although still extremely risky.

  • Us humans do have difficulty with safety. Sometimes we are able to overcome that problem to an extent. Here are some the few examples where humans have done well with safety: FAA commercial airlines, Soyuz, ISS, Shinkansen trains, US Nuclear power post 3 mile island, Vaccines, and the Falcon 9.