Comment by throwaway250612

6 months ago

I am an insider, hence the throw away account.

The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.

This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.

Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.

This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.

Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.

3 comments

throwaway250612

flaminHotSpeedo 6 months ago

You can't pin this entirely on leadership, allowing global blast radius deployments without an extreme level of scrutiny is a failure of engineering culture.

At the very least the global policy should have been deployed prior to the regional service control deployments.

metadat 6 months ago
Engineering culture needs leadership and executive support to succeed and thrive. Blaming failures stemming from top down mandates and directives is unfair, because if the grunts don't follow orders they'll be given poor performance reviews or expediently managed out (aka Fired).
- flaminHotSpeedo 6 months ago
  
  To some extent, yes, strong leadership can squash engineering culture.
  But if engineers never push back, how is leadership supposed to know they're asking for dangerous things? For most engineers, that means explaining risks and/or engaging more senior engineer. For L8 and up, it is absolutely their job to say no to leadership.
  And that's all ignoring things that should have been done here which wouldn't affect the timeline. Leadership doesn't care what order you deploy things in.