Comment by esprehn

2 days ago

I work on Cloud, but not this service. In general:

- All the code has unit tests and integration tests

- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.

- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.

I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.

I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.

Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.

Disclaimer: this is all unofficial and my personal opinions.

> it could have just as likely been an assert() in another language

Asserts are much easier to forbid by policy.

  • That's fair, though `if (isInvalidPolicy) reject();` causes the same outage. So the eng process policy change seems to be failing open and slow rollouts to catch that case too.

> The code was tested, but not this edge case.

so... it wasn't tested

  • So you need to write a test for every single possible case to consider your code tested?

    • I don’t think it’s unreasonable to hold GCP & co to a higher standard than run of the mill software. The platform generates around $100mn per day for Google with a 10% operating margin.

      Testing all inevitable code paths which have the capacity to bring the platform down as soon as it’s deployed doesn’t seem like an unreasonable ask for a corp operating at that scale.

      1 reply →

How is the fact that it was a database change and not a binary or a config supposed to make it ok? A change is a change, global changes that go everywhere at once are a recipe for disaster, it doesn't matter what kind of changes we're talking about. This is a second Crowdsrike

  • This is the core point. A canary deployment that was not preceded by deploying data that activates the region of the binary in question will prove nothing useful at all, while promoting a false sense of security.

    • The root problem is that a dev team didn't appropriately communicate criteria for testing their new feature.

      Which definitely seems like shortcut 'on to the next thing and ignore QA due diligence.'

      3 replies →

> Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded

So like, the requirements are unknown? Or this service isn't critical enough to staff a careful migration?