← Back to context

Comment by marcan_42

4 years ago

Yes. fsync() on Linux pushes down to stable storage, not just drive cache.

OpenBSD, though, apparently behaves like macOS. I'm not sure I like that.

Last time I checked (which is a while at this point, pre SSD) nearly all consumer drives and even most enterprise drives would lie in response to commands to flush the drive cache. Working on a storage appliance at the time, the specifics of a major drive manufacturer's secret SCSI vendor page knock to actually flush their cache was one of the things on their deepest NDAs. Apparently ignoring cache flushing was so ubiquitous that any drive manufacturer looking to have correct semantics would take a beating in benchmarks and lose marketshare. : \

So, as of about 2014, any difference here not being backed by per manufacturer secret knocks or NDAed, one-off drive firmware was just a magic show, with perhaps Linux at least being able to say "hey, at least the kernel tried and it's not our fault". The cynic in me thinks that the BSDs continuing to define fsync() as only hitting the drive cache is to keep a semantically clean pathway for "actually flush" for storage appliance vendors to stick on the side of their kernels that they can't upstream because of the NDAs. A sort of dotted line around missing functionality that is obvious 'if you know to look for it'.

It wouldn't surprise me at all if Apple's NVME controller is the only drive you can easily put your hands on that actually does the correct things on flush, since they're pretty much the only ones without the perverse market pressure to intentionally not implement it correctly.

Since this is getting updoots: Sort of in defense of the drive manufacturers (or at least stating one of the defenses I heard), they try to spec out the capacitance on the drive so that when the controller gets a power loss NMI, they generally have enough time to flush then. That always seemed like a stretch for spinning rust (the drive motor itself was quite a chonker in the watt/ms range being talked about particularly considering seeks are in the 100ms range to start with, but also they have pretty big electrolytic caps on spinning rust so maybe they can go longer?), but this might be less of a white lie for SSDs. If they can stay up for 200ms after power loss, I can maybe see them being able to flush cache. Gods help those HMB drives though, I don't know how you'd guarantee access to the host memory used for cache on power loss without a full system approach to what power loss looks like.

  • Flush with other vendors at least does something as they block for some time too, just not as long as Apple.

    Apple implementation is weird because actual amount of data written doesn't seem to affect flush time.

    • On at least one drive I saw, the flush command was instead interpreted as a barrier to commands being committed to the log in controller DRAM, which could cut into parallelization, and therefore throughput, looking like a latency spike but not a flush out of the cache.

      3 replies →

  • On my benchmarking of some consumer HDD's, back in 2013 or so, the flush time was always what you'd expect based on the drive's RPM. I got no evidence the drive was lying to me. These were all 2.5" drives.

    My understanding was, the capacitor thing on HDD's is to ensure it completely writes out a whole sector, so it passes the checksum. I only heard the flush cache thing with respect to enterprise SSD's. But I haven't been staying on top of things.

So its basically implementation specific, and macOS has its own way of handling it.

That doesnt make it worse - in fact it permits the flexibility you are now struggling with.

edit: downvotes for truth? nice. go read the posix spec then come back and remove your downvotes...

  • Probably more like downvoted because missing the point.

    Sure fsync allows that behavior, but also it's so widely misunderstood that a lot of programs which should do a "full" flush only do a fsync, including Benchmarks. In which case they are not comparable and doing so is cheating.

    But that's not the point!

    The point is that with the M1 Macs SSDs the performance with fully flushing to disk is abysmal bad.

    And as such any application with cares for data integrity and does a full flush can expect noticable performance degradation.

    The fact that Apple neither forces frequent full syncs or at least full syncs when a Application is closed doesn't make it better.

    Though it is also not surprising as it's not the first time Apple set things up under the assumption their hardware is unfailable.

    And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice.

    • "And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice"

      Does the battery last forever? Do they never shut down from overheating, shut down from being too cold, freeze up, they are water and coffee proof?

      Talk to anyone that repairs mac about how high-end and reliable their designs trully are - they are better than bottomn of the barrel craptops, sure, but not particularly amazing and have some astounding design flaws.

      13 replies →

    • I mean the Apple hardware in question is usually a laptop, which has its own very well instrumented battery backup. In most cases the hardware knows well in advance if the battery is gonna run dry.

      And yes the hardware is failable. But the kind if failure that would cause the device to completely lose power is extremely rare. The OS has many chances to take the hint and flush the cache before powering down.

      Note: this is pure conjecture.

    • > The point is that with the M1 Macs SSDs the performance with fully flushing to disk is abysmal bad.

      How sure are we the drives that flush caches more quickly are actually flushing the caches?

      2 replies →

Something that is not quite clear to me yet (I did read the discussion below, thank you Hector for indulging us, very informative): isn't the end behaviour up to the drive controller? That is, how can we be sure that Linux actually does push to the storage or is it possible that the controller cheats? For example, you mention the USB drive test on a Mac — how can we know that the USB stick controller actually does the full flush?

Regardless, I certainly agree that the performance hit seems excessive. Hopefully it's just an algorithm, issue and Apple can fix this with a software update.

*BSDs mostly followed this semantic, as I recall. Probably inherited from a common ancestor.

  • MacOS was really just FreeBSD with a fancier UI. Not sure what is the behavior now, but I'm pretty sure FreeBSD behaved almost exactly the same as a power loss rendered my system unbootable over 10 years ago.

    • >MacOS was really just FreeBSD with a fancier UI.

      I'm sorry but this is incorrect. NeXTSTEP was the primary foundation for Mac OS X, and the XNU kernel was derived from Mach and IIRC 4.4BSD. FreeBSD source was certainly an important sync jumping off point for a number of Unix components of the kernel and CLI userland, there was some code sharing going on for a while (still?), but large components of the kernel and core frameworks were unique (for better or worse).

      6 replies →

Linux does that now. It didn't in the past (something like 2008), and I recall many people arguing about performance or similar at that time :D

I like that. Fsync() was designed with the block cache in mind. IMO how the underlying hardware handles durability is its own business. I think a hack to issue a “full fsync” when battery is below some threshold is a good compromise.