Comment by mumblemumble
5 years ago
> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
I'm so glad to see that they framed this in terms of a bug in a tool designed to prevent human error, rather than simply blaming it on human error.
> I'm so glad to see that they framed this in terms of a bug in a tool designed to prevent human error, rather than simply blaming it on human error.
Wouldn't human error reflect extremely poorly on the company though? I mean, for human error to be the root cause of this mega-outage, that would imply that the company's infrastructure and operational and security practices were so ineffective that a single person screwing up could inadvertently bring the whole company down.
A freak accident that requires all stars to be aligned to even be possible, on the other hand, does not cause a lot of concerns.
Organisations have a bad habit of using "human error" to blame systemic problems whose true root cause is inadequate leadership on individual low level employees. So, we're glad to see Facebook didn't try this shitty practice.
For a modern example, look for information on Symantec's "A tough day as leaders" in which they try to blame an incident that's clearly a result of at least incompetence by senior management on a single person who they've just fired. This is part of the sequence of events that leads to Symantec no longer being a trusted root CA. You won't find that actual post by Symantec because (of course) once they realised it wasn't doing what they wanted they deleted it, but you can find copies and references to it.
For much older examples, look at the early history of the railway in most of the world. Train crashes, blame the (often dead in the crash and thus unable to defend themselves) train driver, hint that they may have been drunk and were certainly incompetent. Owners carry on profiting from unsafe railway and needn't spend any money making it safer.
Boeing's initial response to the 737MAX crashes comes to mind as well.
3 replies →
So, not human error, but inadequate leadership, which is also a human error.
In other words, not human error but human error.
1 reply →
I mean, sure? mumblemumble is still right though. If you're looking for a cynical reason for everything FB related, then, sure, it's true that a human error looks bad.
Human error is a cop-out excuse anyway, since it's not something you can fix going forward. Humans err, and if a mistake was made once it could easily be made again.
To err is human. To really fsk things up you need a computer
To err requires a computer. To really fsk things up requires automation.
The buggy audit tool was probably made by a human too, though.
But reviewed by other humans. At some count, a collective human error becomes a system error.
There actually is a really good tool for auditing such systems.
https://learntla.com/introduction/
I discovered it here on HN just recently, in a comment on a new tool in the same problem space.
System complexity is just a way to avoid blaming individual humans when an error occurs.
- Me, 2021
It's humans all the way down!
something something soylent green
Of course the system was built by humans, but we are discussing the proximate cause of the outage.
"hey let's try this github copilot thingy to write an audit tool"
No they used GitHub Copilot
I wouldn't be surprised if that tool was a shell script with a mistyped conditional somewhere, I really dislike shell scripting.
As opposed to what? Sixteen pages of boilerplate Java/Python?
I wouldn't conflate Java and Python in the boilerplate camp. Python can be very boilerplate-y, but it tends to only happen in the hands of Java developers.
That said, even clean, idiomatic Python isn't as terse as sh. It also isn't as terse as perl. Many would argue that's a good thing. The optimum point for readability isn't found at either of the extremes. Not entirely unlike how the most readable way of writing English is neither shorthand nor blackletter.