Comment by HardCodedBias
6 months ago
This was a shockingly simple coding error. It should have been caught by the two code reviewers.
Turns out asking an LLM for a code review finds the error and an LLM suggests the correct fix.
Rarely do you see a major outage caused by such a glaring error. I suspect that policy changes will be required at GCP.
The SRE team was very fast, the logs were inspected quickly and the relevant check that was being failed was identified within minutes. That's impressive.
But the coding error, that was shocking.
You have a link to the code in question?