← Back to context

Comment by forinti

9 days ago

What structure could possibly preclude backups? I've never seen anything that couldn't be copied elsewhere.

Maybe it was just convenient to have the possibility of losing everything.

I think that alluded to that earlier in the article:

>However, due to the system’s large-capacity, low-performance storage structure, no external backups were maintained — meaning all data has been permanently lost.

I think they decided that their storage was too slow to allow backups?

Seems hard to believe that they couldn't manage any backups... other sources said they had around 900TB of storage. An LTO-9 tape drive holds ~20TB uncompressed, so they could have backed up the entire system with 45 tapes. At 300MB/sec with a single drive, it would take them a month to complete a full backup, so seems like even a slow storage system should be able to keep up with that rate. They'd have a backup that's always a month out of date, but that seems better than no backup at all.

  • Too slow to allow batched backups. Which means you should just make redundant copies at the time of the initial save. Encrypt a copy and send it offsite immediately.

    If your storage performance is low then you don't need fat pipes to your external provider either.

    They either built this too quickly or there was too much industry corruption perverting the process and the government bought an off the shelf solution that was inadequate for their actual needs.

  • Let's run the numbers:

    LTO-9 ~$92/tape in bulk. A 4 drive library with 80 drive capacity costs ~$40k* and can sustain about 1 Gbps. It also needs someone to barcode, inventory, and swap tapes once a week and an off-site vaulting provider like Iron Mountain. That's another $100k/year. Also, that tape library will need to be replaced every 4-7 years, so say 6 years. And those tapes wear out over X uses and sometimes go bad too. It might also require buying a server and/or backup/DR software too. Furthermore, a fire-rated data safe is recommended for about 1-2 weeks' worth of backups and spare media. Budget at least $200k/year for off-site tape backups for a minimal operation. (Let me tell you about the pains of self-destructing SSL2020 AIT-2 Sony drives.)

    If backups for other critical services and this were combined, it would probably be cheaper to scale this kind of service rather reinventing the wheel for just one use-case in one department. That would allow for possibly multiple types of optimizations like network-based backups to nearline storage to then be streamed more directly to tape and using many more tape drives, possibly a tape silo robot(s) and perhaps split into 2-3 backup locations obviating the need for off-site vaulting.

    Furthermore, it might be simpler, although more expensive, to operate another hot-/warm-site for backups and temporary business continuity restoration using a pile of HDDs and a network connection that's probably faster than that tape library. (Use backups, not replication because replication of errors to other sites is fail.)

    Or the easiest option is to use one or more cloud vendors for even more $$$ (build vs. buy tradeoff).

    * Traditionally (~20 years ago), enterprise "retail" prices of gear was sold at around 100% markup allowing for up to around 50% discount when negotiated in large orders. Enterprise gear also had a lifecycle of around 4.5 years while it still might technically work, there wouldn't be vendor support or replacements for them, and so enterprise customers are locked into perpetual planned obsolescence consumption cycles.

    • $500K/year to back up a system used by 750,000 people is $0.66/year. Practically free.

      At least now they see the true cost of not having any off site backups. It's a lot more than $0.66 per user.

Basically it all boils down to budget. Those engineers knew this is a problem and wanted to fix that but that costs some money. And you know, bean counters in the treasury are basically like, "well it works well, why do we need that fix?" and the last conservative govt. was in a full spending cut mode. You know what happened there.

A key metric for recovery is the time it takes to read or write an entire drive (or drive array) in full. This is simply a function of the capacity and bandwidth, which has been getting worse and worse as drive capacities increase exponentially, but the throughput hasn't kept up at the same pace.

A typical 2005 era drive from two decades ago might have been 0.5 TB with a throughput of 70 MB/s for a full-drive transfer time (FDTT) of about 2 hours. A modern 32 TB drive is 64x bigger but has a throughput of only 270 MB/s which is less than 4x higher. Hence the FDDT is 33 hours!

This is the optimal scenario, things get worse in modern high-density disk arrays that may have 50 drives in a single enclosure with as little as 8-32 Gbps (1 GB/sec to 4 GB/sec) of effective bandwidth. That can push FDDT times out to many days or even weeks.

I've seen storage arrays where the drive trays were daisy chained, which meant that while the individual ports were fast, the bandwidth per drive would drop precipitously as capacity was expanded.

It's a very easy mistake to just keep buying more drives, plugging them in, and never going back to the whiteboard to rethink the HA/DR architecture and timings. The team doing this kind of BAU upgrade/maintenance is not the team that designed the thing originally!

Its Korea, so most likely fear of annoying higher up when seeking approvals.

Koreans are weird, for example they will rather eat contractual penalty than report problems to the boss.