Comment by supermatt

4 years ago

It may be interpreting it differently. You arent comparing apples to apples, quite literally.

Why not compare macOS and linux on approved x86 mac hardware. i.e. fusion drive or whatever.

Also, as suggested - try F_BARRIERFSYNC, which flushes anything before the barrier (used for WAL IIRC).

This affects T2 Macs too, which use the same NVMe controller design as M1 Macs.

We've looked at NVMe command traces from running macOS under a transparent hypervisor. We've issued NVMe commands outside of Linux from a bare-metal environment. The 20ms flush penalty is there for Apple's NVMe implementation. It's not some OS thing. And other drives don't have it. And I checked and Apple's NVMe controller is doing 10MB/s of DRAM memory traffic when issued flushes, for some reason (yes, we can get those stats). And we know macOS does not properly flush with just fsync() because it actively loses data on hard shutdowns. We've been fighting this issue for a while now, it's just that it only just hit us yesterday/today that there is no magic in macOS - it just doesn't flush, and doesn't guarantee data persistence, on fsync().

  • Ive just been scanning through linux kernel code (inc ext4). Are you sure that its not issuing a PREFLUSH? What are your barrier options on the mount? I think you will find these are going to be more like F_BARRIERFSYNC.

    I couldnt find much info about it - but the official docs are here: https://kernel.org/doc/html/v5.17-rc3/block/writeback_cache_...

    • Those are Linux concepts. What you're looking for is the actual NVMe commands. There's two things: FLUSH (which flushes the whole cache), and a WRITE with the FUA bit set (which basically turns that write into write-through, but does not guarantee anything about other commands). The latter isn't very useful for most cases, since you usually want at least barrier semantics if not a full flush for previously completed writes. And that leaves you with FLUSH. Which is the one that takes 20ms on these drives.

      4 replies →

It seems to be pretty apples to apples, they're running the same benchmark using equivalent data storage APIs on both systems. What are you thinking might be different? The Linux+WD drive isn't making the data durable? Or that OSX does something stupid which could be the cause of the slowdown rather than the drive? Both seem implausible.