Comment by danarmak

2 years ago

It will at the very least notice that the read data does not match the stored checksum and not return the garbage data to the application. In redundant (raidz) setups it will then read the data from another disk, and update the faulty disk. In a non-redundant setup (or if enough disks are corrupted) it will signal an IO error.

An error is preferred to silently returning garbage data!

The "zeroed-out file" problem is not about firmware lying though, it is about applications using fsync() wrongly or not at all. Look up the O_PONIES controversy.

Sure, due to their COW nature zfs and btrfs provide better behavior despite broken applications. But you can't solve persistence in the face of lying firmware.

Even thought zfs has some enhancements to not corrupt itself on such drives, if you run for example a database on top, all guarantees around commit go out the window.

  • "Renaming a file should always happen-after pending writes to that file" is not a big pony. I think it's a reasonable request even in the absence of fsync.

    • Well, for one rename() is not always meant to be durable. It can also be used for IPC, for example some mail servers use it to move mails between queues. Flushing before every rename is unexpected in that situation.

      Fun fact: rename() is atomic with respect to running applications per POSIX, that the on-disk rename is also atomic is only incidental.

      1 reply →

As an aside, can you still get the bad checksum file contents with zfs? Eg if it's a big database with its own checksums you might want to run a db level recovery on it.