Comment by jacquesm

9 days ago

> Permanently losing data at a given store point isn't relevant to losing data overall.

I can't reveal any details but it was a lot more than just a given storage point. The interesting thing is that there were multiple points along the way where the damage would have been recoverable but their absolute incompetence made matters much worse to the point where there were no options left.

> FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

If you can't do the job you should get out of the kitchen.

5 comments

jacquesm

xoa 9 days ago

>I can't reveal any details but it was a lot more than just a given storage point

Sorry, not brain not really clicking tonight and used lazy imprecise terminology here, been a long one. But what I meant by "store point" was any single data repository that can be interacted with as a unit, regardless of implementation details, that's part of a holistic data storage strategy. So in this case the entirety of IBM would be a "storage point", and then your own self-hosted system would be another, and if you also had data replicated to AWS etc those would be others. IBM (or any other cloud storage provider operating in this role) effectively might as well simply be another hard drive. A very big, complex and pricey magic hard drive that can scale its own storage and performance on demand granted, but still a "hard drive".

And hard drives fail, and that's ok. Regardless of the internal details of how the IBM-HDD ended up failing, the only way it'd affect the overall data is if that failure happened simultaneously with enough other failures at local-HDD and AWD-HDD and rsync.net-HDD and GC-HDD etc etc that it exceeded available parity to rebuild. If these are all mirrors, then only simultaneous failure of every single last one of them would do it. It's fine for every single last one of them to fail... just separately, with enough of a time delta between each one that the data can be rebuilt on another.

>If you can't do the job you should get out of the kitchen.

Isn't that precisely what bringing in external entities as part of your infrastructure strategy is? You're not cooking in their kitchen.

jacquesm 9 days ago
Ah ok, clear. Thank you for the clarification. Some more interesting details: the initial fault was triggered by a test of a fire suppression system, that would have been recoverable. But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one, more so when they found out that their backups were incomplete. I still wonder if they ever did RCA/PM on this and what their lessons learned were. It should be a book sized document given how much went wrong. I got the call after their own efforts failed by one of their customers and after hearing them out I figured this is not worth my time because it just isn't going to work.
- xoa 9 days ago
  
  Thanks in turn for the details, always fascinating (and useful for lessons... even if not always for the party in question dohoho) to hear a touch of inside baseball on that kind of incident.
  >But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one
  The sentence "and that's when a small problem became a big problem" comes up depressingly frequently in these sorts of post mortems :(. Sometimes sort of feels like, along all the checklists and training and practice and so on, there should also simply be the old Hitchhiker's Guide "Don't Panic!" sprinkled liberally around along with a dabbing of red/orange "...and Don't Be Clever" right after it. We're operating in alternate/direct law here folks, regular assumptions may not hold. Hit the emergency stop button and take a breath.
  But of course management and incentive structures play a role in that too.

Dylan16807 9 days ago

In this context the entirety of IBM cloud is basically a single storage point.

(If IBM was also running the local storage then we're talking about a very different risk profile from "run your own storage, back up to a cloud" and the anecdote is worth noting but not directly relevant.)

hedora 9 days ago

If that’s the case, then they should make it clear they don’t provide data backup.
A quick search reveals IBM does still sell backup solutions, including ones that backup from multiple cloud locations and can restore to multiple distinct cloud locations while maintaining high availability.
So, if the claims are true, then IBM screwed up badly.