Comment by SoftTalker

2 years ago

Honestly I've never worked anywhere that didn't have some kind of "war story" that was told about how some admin or programmer mistake resulted in the deletion of some vast swathe of data, and then the panic-driven heroics that were needed to recover.

It shouldn't happen, but it does, all the time, because humans aren't perfect, and neither are the things we create.

1 comment

SoftTalker

20after4 2 years ago

Sure, it's the tone and content of their response that is worrying, more than the fact that an incident happened. An honest and transparent root cause analysis with technically sound and thorough mitigations, including changes in policy with regard to defaults. Their response seems like only the most superficial, bare-minimum approximation of an appropriate response to deleting a large customer's entire account. If I were on the incident response team I'd be strongly advocating for at lease these additional changes:

Make deletes opt-in rather than opt out. Make all large-scale deletions have some review process with automated tests and a final human review. And not just some low-level technical employee, the account managers should have seen this on their dashboard somewhere long before it happened. Finally, undertake a thorough and systematic review of other services to look for similar failure modes, especially with regard anything which is potentially destructive and can conceivably be default-on in the absence of a supplied configuration parameter.