Comment by Dylan16807

2 years ago

The current default is data=ordered, which should prevent this problem if the hardware doesn't lie. The data doesn't go in the journal, but it has to be written before the journal is committed.

There was a point where ext3 defaulted to data=writeback, which can definitely give you files full of null bytes.

And data=journal exists but is overkill for this situation.

It's likely because of delayed allocations (delalloc): https://issuetracker.google.com/issues/172227346#comment6

because the only guarantee which data=ordered provides is the security guarantee that stale data won't be revealed.

Yes, it's bad and breaks prefix append consistency, and does not match the documentation...

> which should prevent this problem if the hardware doesn't lie.

Or, one can take the ZFS approach and assume the hardware often lies :)

  • I do not know how zfs will overcome hardware lying. If its going to fetch data that is in the drives cache, how will it overcome the persistence problem ?

    • It will at the very least notice that the read data does not match the stored checksum and not return the garbage data to the application. In redundant (raidz) setups it will then read the data from another disk, and update the faulty disk. In a non-redundant setup (or if enough disks are corrupted) it will signal an IO error.

      An error is preferred to silently returning garbage data!

      7 replies →

    • Actual file data ends up in the same transaction group (txg) as metadata if both are changed within the same txg commit (either flushed explicitly, due to recordsize/buffer limit being reached, or txg commit timeout - 5 seconds by default), so if there is a write barrier violation caused by hardware lies followed by an untimely loss of power, the checksums for the txg updates won't match and they get rolled-back until the last valid one when importing the pool - which doesn't end up zero'ing out extents of a file (like in xfs) or ending up with a zero file size (like on ext3/ext4).

The "data" setting of ext filesystems isn't replacement for fsync().

  • It's not a replacement but it can give you some guarantees.

    Also fsync is a terrible API that should be replaced, but that's mostly a different topic.

    • At least on linux you can use io_uring to make fsync asynchronous. And you can initiate some preparatory flushing with sync_file_range and only do the final commit with fsync to cut down the latency.

My only n=1 observation is that null values in logs occur on nvme, ssd and spinning rust. All ext4 with defaults. I do have the idea it occurs more on nvme drives though. But maybe my systems settings are just booked.