Comment by rincebrain

1 day ago

IMO, it's that any defense, humans or automated, is imperfect, and life is a series of tradeoffs.

You can write as many unit tests as you want, and integration tests that your system works as you expect on sample data, static analysis to scream if you're doing something visibly unsafe, staged rollout from nightly builds to production, and so on and so on, but eventually, at large enough scale, you're going to find a gap in those layered safety measures, and if you're unlucky, it's going to be a gap in all of them at once.

It's the same reasoning from said book as why getting another nine is always going to involve much more work than the previous ones - eventually you're doing things like setting up complete copies of your stack running stable builds from months ago and replaying all the traffic to them in order to be able to fail over to them on a moment's notice, meaning that you also can't roll out new features until the backup copies support it too, and that's a level of cost/benefit that nobody can pay if the service is large enough.

When working on OpenZFS, a number of bugs have come from things like "this code in isolation works as expected, but an edge case we didn't know about in data written 10 years ago came up", or "this range of Red Hat kernels from 3 Red Hat releases ago has buggy behavior, and since we test on the latest kernel of that release, we didn't catch it".

Eventually, if there's enough complexity in the system, you cannot feasibly test even all the variation you know about, so you make tradeoffs based on what gets enough benefit for the cost.

(I'm an SRE at Google, not on any team related to this incident, all opinions unofficial/my own, etc.)

0 comments

rincebrain

No comments yet

Contribute on Hacker News ↗