Comment by matja

2 years ago

> which should prevent this problem if the hardware doesn't lie.

Or, one can take the ZFS approach and assume the hardware often lies :)

I do not know how zfs will overcome hardware lying. If its going to fetch data that is in the drives cache, how will it overcome the persistence problem ?

  • It will at the very least notice that the read data does not match the stored checksum and not return the garbage data to the application. In redundant (raidz) setups it will then read the data from another disk, and update the faulty disk. In a non-redundant setup (or if enough disks are corrupted) it will signal an IO error.

    An error is preferred to silently returning garbage data!

    • The "zeroed-out file" problem is not about firmware lying though, it is about applications using fsync() wrongly or not at all. Look up the O_PONIES controversy.

      Sure, due to their COW nature zfs and btrfs provide better behavior despite broken applications. But you can't solve persistence in the face of lying firmware.

      Even thought zfs has some enhancements to not corrupt itself on such drives, if you run for example a database on top, all guarantees around commit go out the window.

      3 replies →

    • As an aside, can you still get the bad checksum file contents with zfs? Eg if it's a big database with its own checksums you might want to run a db level recovery on it.

  • Actual file data ends up in the same transaction group (txg) as metadata if both are changed within the same txg commit (either flushed explicitly, due to recordsize/buffer limit being reached, or txg commit timeout - 5 seconds by default), so if there is a write barrier violation caused by hardware lies followed by an untimely loss of power, the checksums for the txg updates won't match and they get rolled-back until the last valid one when importing the pool - which doesn't end up zero'ing out extents of a file (like in xfs) or ending up with a zero file size (like on ext3/ext4).