Comment by jacquesm

10 days ago

I'm aware of a big cloud services provider (I won't name any names but it was IBM) that lost a fairly large amount of data. Permanently. So that too isn't a guarantee. They simply should have made local and off-line backups, that's the gold standard, and to ensure that those backups are complete and can be used to restore from scratch to a complete working service.

>I'm aware of a big cloud services provider (I won't name any names but it was IBM) that lost a fairly large amount of data. Permanently. So that too isn't a guarantee.

Permanently losing data at a given store point isn't relevant to losing data overall. Data store failures are assumed or else there'd be no point in backups. What matters is whether failures in multiple points happen at the same time, which means a major issue is whether "independent" repositories are actually truly independent or whether (and to what extent) they have some degree of correlation. Using one or more completely unique systems done by someone else entirely is a pretty darn good way to bury accidental correlations with your own stuff, including human factors like the same tech people making the same sorts of mistakes or reusing the same components (software, hardware or both). For government that also includes political factors (like any push towards using purely domestic components).

>They simply should have made local and off-line backups

FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

  • > Permanently losing data at a given store point isn't relevant to losing data overall.

    I can't reveal any details but it was a lot more than just a given storage point. The interesting thing is that there were multiple points along the way where the damage would have been recoverable but their absolute incompetence made matters much worse to the point where there were no options left.

    > FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

    If you can't do the job you should get out of the kitchen.

    • >I can't reveal any details but it was a lot more than just a given storage point

      Sorry, not brain not really clicking tonight and used lazy imprecise terminology here, been a long one. But what I meant by "store point" was any single data repository that can be interacted with as a unit, regardless of implementation details, that's part of a holistic data storage strategy. So in this case the entirety of IBM would be a "storage point", and then your own self-hosted system would be another, and if you also had data replicated to AWS etc those would be others. IBM (or any other cloud storage provider operating in this role) effectively might as well simply be another hard drive. A very big, complex and pricey magic hard drive that can scale its own storage and performance on demand granted, but still a "hard drive".

      And hard drives fail, and that's ok. Regardless of the internal details of how the IBM-HDD ended up failing, the only way it'd affect the overall data is if that failure happened simultaneously with enough other failures at local-HDD and AWD-HDD and rsync.net-HDD and GC-HDD etc etc that it exceeded available parity to rebuild. If these are all mirrors, then only simultaneous failure of every single last one of them would do it. It's fine for every single last one of them to fail... just separately, with enough of a time delta between each one that the data can be rebuilt on another.

      >If you can't do the job you should get out of the kitchen.

      Isn't that precisely what bringing in external entities as part of your infrastructure strategy is? You're not cooking in their kitchen.

      2 replies →

    • In this context the entirety of IBM cloud is basically a single storage point.

      (If IBM was also running the local storage then we're talking about a very different risk profile from "run your own storage, back up to a cloud" and the anecdote is worth noting but not directly relevant.)

      1 reply →

DigitalOcean lost some of my files in their object storage too: https://status.digitalocean.com/incidents/tmnyhddpkyvf

Using a commercial provider is not a guarantee.

  • DO Spaces, for at least a year after launch, had no durability guarantees whatsoever. Perhaps they do now, but I wouldn’t compare DO in any meaningful way to S3, which has crazy high durability guarantees as well as competent engineering effort expended on designing and validating that durability.