Comment by fulafel

2 years ago

I think this is typical behaviour with ext4 on Linux, if the application doesn't do fsync/fdatasync to flush the data to disk.

Depending on mount options, ext4fs does metadata journaling ensuring the FS itself is not borked, but not data journaling which would safeguard the file contents in event of unclean shutdown with pending writes in the caches.

The same phenomenon is at play when people complain that their log files contain NUL bytes after a crash. The file system metadata has been updated for the size of the file to fit the appended write, but the data itself was not written out yet.

30 comments

fulafel

Dylan16807 2 years ago

The current default is data=ordered, which should prevent this problem if the hardware doesn't lie. The data doesn't go in the journal, but it has to be written before the journal is committed.

There was a point where ext3 defaulted to data=writeback, which can definitely give you files full of null bytes.

And data=journal exists but is overkill for this situation.

charleshn 2 years ago
It's likely because of delayed allocations (delalloc): https://issuetracker.google.com/issues/172227346#comment6
because the only guarantee which data=ordered provides is the security guarantee that stale data won't be revealed.
Yes, it's bad and breaks prefix append consistency, and does not match the documentation...
- fulafel 2 years ago
  
  For more context, that's a comment from one of the ext4 main authors, Ted Ts'o. the other subsequent comment from him spells out the case more but sadly no spelled out NUL byte origin story I spotted from skimming.
  
  1 reply →
matja 2 years ago
> which should prevent this problem if the hardware doesn't lie.
Or, one can take the ZFS approach and assume the hardware often lies :)
- worthless-trash 2 years ago
  
  I do not know how zfs will overcome hardware lying. If its going to fetch data that is in the drives cache, how will it overcome the persistence problem ?
  
  9 replies →
altfredd 2 years ago
The "data" setting of ext filesystems isn't replacement for fsync().
- Dylan16807 2 years ago
  
  It's not a replacement but it can give you some guarantees.
  Also fsync is a terrible API that should be replaced, but that's mostly a different topic.
  
  1 reply →
consp 2 years ago

My only n=1 observation is that null values in logs occur on nvme, ssd and spinning rust. All ext4 with defaults. I do have the idea it occurs more on nvme drives though. But maybe my systems settings are just booked.

lxgr 2 years ago

I don't think that's how it works: Flushing metadata before data would be a security concern (consider e.g. the metadata change of increasing a file's length due to an append before the data change itself), so file systems usually only ever do the opposite, which is safe.

Getting back zeroes after a metadata sync (which must follow a data sync) would accordingly be an indication of something weird having happened at the disk level: We'd expect to either see no data at all, or correct data, but not zeroes or any other file's or previously written stale data.

Filligree 2 years ago
The file isn't stored contiguously on disk, so that would depend on the implementation of the filesystem. Perhaps the size of the file can be changed, without extents necessarily being allocated to cover the new size?
I seem to vaguely recall an issue like that, for ext4 in particular. Of course it's possible in general for any filesystem that supports holes, but I don't think we can necessarily assume that the data is always written, and all the pointers to it also written, before the file-size gets updated.
- lxgr 2 years ago
  
  At least for ext4 and actually written data (i.e. not ftruncate’d files), I believe zeroes should really not occur.
  Both extents and the file size are metadata as far as I understand, which would be atomically updated through the journal.
  Data can be written before metadata (in data=ordered mode):
  > All data are forced directly out to the main file system prior to its metadata being committed to the journal.
  
  2 replies →
fulafel 2 years ago

I think there could semi-reasonably be case for the zero bytes appearing if the fs knows there should be something written there, and the block is has been allocated, but no write yet. Then it's not compromising confidentiality to zero the allocated block when recovering the journal when the disk is mounted. But the zero byte origin doesn't seem to be spelled out anywhere so this is just off the cuff reasoning.
colanderman 2 years ago
The file's size could have been set by the application before copying data to it. This will result in a file which reads all zeroes.
Or if it were a hardware ordering fault, remember that SSD TRIM is typically used by modern filesystems to reclaim unused space. TRIMmed blocks read as zero.
- lxgr 2 years ago
  
  > The file's size could have been set by the application before copying data to it. This will result in a file which reads all zeroes.
  Hm, is that a common approach? I thought applications mostly use fallocate(2) for that if it's for performance reasons, which does not change the nominal file size.
  Actually allocating zeroes sounds like it could be quite inefficient and confusing, but then again, fallocate is not portable POSIX.
  > Or if it were a hardware ordering fault
  That's what I suspect might be going on here.
  
  1 reply →
Dylan16807 2 years ago
Ext3 will totally let you expose yourself to those security issues. I'm not sure about ext4.
- lxgr 2 years ago
  
  Only in data=writeback mode, which is not the default in either ext3 or ext4.