Comment by throwaway250612
1 day ago
I am an insider, hence the throw away account.
The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.
This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.
Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.
This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.
Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.
You can't pin this entirely on leadership, allowing global blast radius deployments without an extreme level of scrutiny is a failure of engineering culture.
At the very least the global policy should have been deployed prior to the regional service control deployments.
Engineering culture needs leadership and executive support to succeed and thrive. Blaming failures stemming from top down mandates and directives is unfair, because if the grunts don't follow orders they'll be given poor performance reviews or expediently managed out (aka Fired).