Comment by _urga

8 years ago

I agree with using O_DSYNC to surface the error to the write call, rather than waiting until the fsync call, which is often not checked by the user.

I did some testing recently [1] with O_DIRECT + O_DSYNC and found some surprising performance results, on Linux it can be similar to O_DIRECT + fsync() after every write for hard drives. But as soon as you are doing grouped writes, performance is almost always better by using O_DIRECT + fsync() after the end of the group.

For SSD drives though, O_DIRECT + O_DSYNC can be faster than O_DIRECT + fsync() after the end of the group, if you are pipelining your IO, e.g. you encrypt and checksum the next batch of sectors while you wait for the previous batch of checksummed and encrypted sectors to be written out. Because SSDs are so much faster, you can actually afford to slow down the write a little more by using O_DSYNC, so that your write is not faster than the related CPU work.

[1] https://github.com/ronomon/direct-io.

A more advanced (and somewhat easy to get wrong) option would be sync_file_range combined with fdatasync, which allows to roughly emulate O_DSYNC overall but without blocking synchronously for IO.

  • sync_file_range is different to fsync, fdatasync and O_DSYNC in that it does not flush the disk write cache (whereas the latter explicitly do on newer kernels):

    https://linux.die.net/man/2/sync_file_range:

      This system call does not flush disk write caches and thus does not
      provide any data integrity on systems with volatile disk write caches.

    • Hence "sync_file_range combined with fdatasync". sfr is only useful to get the kernel to start the write-out now.