← Back to context

Comment by dehrmann

1 day ago

I recently started as a GCP SRE. I don't have insider knowledge about this, and my views on it are my own.

The most important thing to look at is how much had to go wrong for this to surface. It had to be a bug without test coverage that wasn't covered by staged rollouts or guarded by a feature flag. That essentially means a config-in-db change. Detection was fast, but rolling out the fix was slow out of fear of making things worse.

The NPE aspect is less interesting. It could have been any number of similar "this can't happen" errors. It could have been mutually exclusive fields are present in a JSON object, and the handling logic does funny things. Validation during mutation makes sense, but the rollout strategy is more important since it can catch and mitigate things you haven't thought of.