← Back to context

Comment by shellac

4 years ago

Variants of the FSYNC story have been going on for decades now. The framing varies, but typically somebody is benchmarking IO (often in the context of database benchmarking) and discovers a curious variance by OS.

On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?

> On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?

That depends on the drive having power loss protection, which comes most of the time in the form of a capacitor that powers the drive long enough to guarantee that its buffers are flushed to persistent storage.

Consumer SSDs often do not have that, so flushing is really important there, at least if your data, or no FS corruption is important to you.

Enterprise SSDs almost always have power loss protection, so there it isn't required for consistency’s sake, albeit in-flight data that didn't hit the block device yet is naturally not protected by that, most FS handle that fine by default though.

Note that Linux, for example, does by default a periodic flush every 30s independent of caching/flush settings, so that's normally the upper limit you'd lose, depending on the workload it can be still a relatively long time frame.

https://sysctl-explorer.net/vm/dirty_expire_centisecs/

  • Those VM tunables are about dirty OS cache, not dirty drive cache. If you fsync() a file on Linux it will be pushed to the drive and (if the drive does not have battery/capacitor-backed cache) flushed from drive cache to stable storage. If you don't fsync() then AIUI all bets are off, but in practice the drive will eventually get around to flushing your data anyway. The OS has one timeout for cache flushes and the drive should have another one, one would hope.

    • As you noted, Apple's fsync() behavior is defensible if PLP is assumed. Committing through the PLP cache isn't how these drives are meant to operate - hence the poor behavior of F_FULLSYNC.

      But this isn't specific to Macs and iDevices. Some non-PLP drives also struggle with sync writes on FreeBSD [1]. Most enterprises running RDBMS mandate PLP for both performance and reliability. I understand why this is frustrating for porting Linux, but Apple is allowed to make strong assumptions about how their hardware interoperates.

      [1] https://www.truenas.com/community/threads/slog-and-power-los...

      2 replies →

On this NVMe, flushing is slower than on some spinning disks, so it apparently matters.

  • Yes, I would have skipped the fsync thing, which carries a lot of baggage, and concentrate on this.

    Btw, are you sure those spinning disks are actually flushing to rust? Caches all the way down... ;-)

    • I mean, typical seek time on rust is O(10ms) and these controllers are spending 20ms flushing a few sectors. Obviously rust would do worse if you have the cache full of random writes, though. The problem here is the huge base cost.