Comment by xoa

9 days ago

>I'm aware of a big cloud services provider (I won't name any names but it was IBM) that lost a fairly large amount of data. Permanently. So that too isn't a guarantee.

Permanently losing data at a given store point isn't relevant to losing data overall. Data store failures are assumed or else there'd be no point in backups. What matters is whether failures in multiple points happen at the same time, which means a major issue is whether "independent" repositories are actually truly independent or whether (and to what extent) they have some degree of correlation. Using one or more completely unique systems done by someone else entirely is a pretty darn good way to bury accidental correlations with your own stuff, including human factors like the same tech people making the same sorts of mistakes or reusing the same components (software, hardware or both). For government that also includes political factors (like any push towards using purely domestic components).

>They simply should have made local and off-line backups

FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

> Permanently losing data at a given store point isn't relevant to losing data overall.

I can't reveal any details but it was a lot more than just a given storage point. The interesting thing is that there were multiple points along the way where the damage would have been recoverable but their absolute incompetence made matters much worse to the point where there were no options left.

> FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

If you can't do the job you should get out of the kitchen.

  • >I can't reveal any details but it was a lot more than just a given storage point

    Sorry, not brain not really clicking tonight and used lazy imprecise terminology here, been a long one. But what I meant by "store point" was any single data repository that can be interacted with as a unit, regardless of implementation details, that's part of a holistic data storage strategy. So in this case the entirety of IBM would be a "storage point", and then your own self-hosted system would be another, and if you also had data replicated to AWS etc those would be others. IBM (or any other cloud storage provider operating in this role) effectively might as well simply be another hard drive. A very big, complex and pricey magic hard drive that can scale its own storage and performance on demand granted, but still a "hard drive".

    And hard drives fail, and that's ok. Regardless of the internal details of how the IBM-HDD ended up failing, the only way it'd affect the overall data is if that failure happened simultaneously with enough other failures at local-HDD and AWD-HDD and rsync.net-HDD and GC-HDD etc etc that it exceeded available parity to rebuild. If these are all mirrors, then only simultaneous failure of every single last one of them would do it. It's fine for every single last one of them to fail... just separately, with enough of a time delta between each one that the data can be rebuilt on another.

    >If you can't do the job you should get out of the kitchen.

    Isn't that precisely what bringing in external entities as part of your infrastructure strategy is? You're not cooking in their kitchen.

    • Ah ok, clear. Thank you for the clarification. Some more interesting details: the initial fault was triggered by a test of a fire suppression system, that would have been recoverable. But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one, more so when they found out that their backups were incomplete. I still wonder if they ever did RCA/PM on this and what their lessons learned were. It should be a book sized document given how much went wrong. I got the call after their own efforts failed by one of their customers and after hearing them out I figured this is not worth my time because it just isn't going to work.

      1 reply →

  • In this context the entirety of IBM cloud is basically a single storage point.

    (If IBM was also running the local storage then we're talking about a very different risk profile from "run your own storage, back up to a cloud" and the anecdote is worth noting but not directly relevant.)

    • If that’s the case, then they should make it clear they don’t provide data backup.

      A quick search reveals IBM does still sell backup solutions, including ones that backup from multiple cloud locations and can restore to multiple distinct cloud locations while maintaining high availability.

      So, if the claims are true, then IBM screwed up badly.