This F_FULLFSYNC behaviour has been like this on OSX for as long as I can remember. It is a hint to ensures that the data in the write buffer has been flushed to stable storage - this is historically a limitation of fsync that is being accounted for - are you 1000% sure it does as you expect on other OSes?
Maybe unrealistic expectation for all OSes to behave like linux.
Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.
Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.
Last time I checked (which is a while at this point, pre SSD) nearly all consumer drives and even most enterprise drives would lie in response to commands to flush the drive cache. Working on a storage appliance at the time, the specifics of a major drive manufacturer's secret SCSI vendor page knock to actually flush their cache was one of the things on their deepest NDAs. Apparently ignoring cache flushing was so ubiquitous that any drive manufacturer looking to have correct semantics would take a beating in benchmarks and lose marketshare. : \
So, as of about 2014, any difference here not being backed by per manufacturer secret knocks or NDAed, one-off drive firmware was just a magic show, with perhaps Linux at least being able to say "hey, at least the kernel tried and it's not our fault". The cynic in me thinks that the BSDs continuing to define fsync() as only hitting the drive cache is to keep a semantically clean pathway for "actually flush" for storage appliance vendors to stick on the side of their kernels that they can't upstream because of the NDAs. A sort of dotted line around missing functionality that is obvious 'if you know to look for it'.
It wouldn't surprise me at all if Apple's NVME controller is the only drive you can easily put your hands on that actually does the correct things on flush, since they're pretty much the only ones without the perverse market pressure to intentionally not implement it correctly.
Since this is getting updoots: Sort of in defense of the drive manufacturers (or at least stating one of the defenses I heard), they try to spec out the capacitance on the drive so that when the controller gets a power loss NMI, they generally have enough time to flush then. That always seemed like a stretch for spinning rust (the drive motor itself was quite a chonker in the watt/ms range being talked about particularly considering seeks are in the 100ms range to start with, but also they have pretty big electrolytic caps on spinning rust so maybe they can go longer?), but this might be less of a white lie for SSDs. If they can stay up for 200ms after power loss, I can maybe see them being able to flush cache. Gods help those HMB drives though, I don't know how you'd guarantee access to the host memory used for cache on power loss without a full system approach to what power loss looks like.
Something that is not quite clear to me yet (I did read the discussion below, thank you Hector for indulging us, very informative): isn't the end behaviour up to the drive controller? That is, how can we be sure that Linux actually does push to the storage or is it possible that the controller cheats? For example, you mention the USB drive test on a Mac — how can we know that the USB stick controller actually does the full flush?
Regardless, I certainly agree that the performance hit seems excessive. Hopefully it's just an algorithm, issue and Apple can fix this with a software update.
I like that. Fsync() was designed with the block cache in mind. IMO how the underlying hardware handles durability is its own business. I think a hack to issue a “full fsync” when battery is below some threshold is a good compromise.
It's important to read the entire document including the notes, which informs the reader of a pretty clear intent (emphasis mine):
> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.
This seems consistent with user expectations - fsync() completion should mean data is fully recorded and therefore power-cycle- or crash-safe.
At least it is also implemented by windows, which cause apt-get in hyperv vm slower
And also unbearable slow for loopback device backed docker container in the vm due to double layer of cache. I just add eat-my-data happily because you can't save a half finished docker image anyway.
OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it? I don't have one at hand but IIRC it did.
At least the OSX man page admits to the detail.
The rationale in the POSIX document for a null implementation seems reasonable (or at least plausible), but it does not really seem to apply to general OSX systems at all. So even if they didn't define _POSIX_SYNCHRONIZED_IO it would be against the spirit of the specification.
I'm actually curious why they made fsync do anything at all though.
The implication (in fact no, it's explicitly stated) is that this fsync() behaviour on OSX will be a surprise for developers working on cross platform code or coming from other OS's and will catch them out.
However if in fact it's quite common for other OS's to exhibit the same or similar behaviour (BSD for example does this too, which makes sense as OSX has a lot of BSD lineage), that argument of least surprise falls a bit flat.
That's not to say this is good behaviour, I think Linux does this right, the real issue is the appalling performance for flushing writes.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system.
> fsync() might or might not actually cause data to be written where it is safe from a power failure.
The fsync() function shall request that all data for
the open file descriptor named by fildes is to be
transferred to the storage device associated with the
file described by fildes. The nature of the transfer
is implementation-defined. The fsync() function shall
not return until the system has completed that action
or until an error is detected.
then:
The fsync() function is intended to force a physical
write of data from the buffer cache, and to assure
that after a system crash or other failure that all
data up to the time of the fsync() call is recorded
on the disk. Since the concepts of "buffer cache",
"system crash", "physical write", and "non-volatile
storage" are not defined here, the wording has to be
more abstract.
The only reason to doubt the clarity of the above is that POSIX does not consider crashes and power failures to be in scope. It says so right in the quoted text.
Crashes and power failures are just not part of the POSIX worldview, so in POSIX there can be no need for sync(2) or fsync(2), or fcntl(2) w/ F_FULLFSYNC! Why even bother having those system calls? Why even bother having the spec refer to the concept at all?
Well, the reality is that some allowance must be made for crashes and power failures, and that includes some mechanism for flushing caches all the way to persistent storage. POSIX is a standard that some real-life operating systems aim to meet, but those operating systems have to deal with crashes and power failures because those things happen in real life, and because their users want the operating systems to handle those events as gracefully as possible. Some data loss is always inescapable, but data corruption would be very bad, which is why filesystems and applications try to do things like write-ahead logging and so on.
That is why sync(2), fsync(2), fdatasync(2), and F_FULLFSYNC exist. It's why they [well, some of them] existed in Unix, it's why they still exist in Unix derivatives, it's why they exist in Unix-alike systems, it's why they exist in Windows and other not-remotely-POSIX operating systems, and it's why they exist in POSIX.
If they must exist in POSIX, then we should read the quoted and linked page, and it is pretty clear: "transferred to the storage device" and "intended to force a physical write" can only mean... what that says.
It would be fairly outrageous for an operating system to say that since crashes and power failures are outside the scope of POSIX, the operating system will not provide any way to save data persistently other than to shut down!
> the fsync() function is intended to force a physical write of data from the buffer cache
If they define _POSIX_SYNCHRONIZED_IO, which they dont.
fsync wasnt defined as requiring a flush until version 5 of the spec. It was implemented in BSDs loooong before then. Apple introduced F_FULLFSYNC prior to fsync having that new definition.
I dont disagree with you, but it is what it is. History is a thing. Legacy support is a thing. Apple likely didnt want to change peoples expectations of the behaviour on OSX - they have their own implementation after all (which is well documented, lots of portable software and libs actively uses it, and its built in to the higher level APIs that Mac devs consume).
If you need to run software/servers with any kind of data consistency/reliability on OS X this is definitely something you should be aware of and will be a footgun if you're used to Linux.
Macs in datacentres are becoming increasingly common for CI, MDM, etc.
The history is also interesting. It's not that "macOS cheats", but that it sincerely inherited the status quo of many years, then tried to go further by adding F_FULLFSYNC. However, Linux since got better, leaving macOS stuck in the past and everybody surprised. It's a big problem.
Docs [1] suggests that even F_FULLFSYNC might not be enough. Quote:
> Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.
When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Having a separate syscall is annoying, but workable. Having a scenario where we call flush and cannot ensure that this is the case is BAD.
Note that handling flush failures is expected, but all databases require that flushing successfully will make the data durable.
Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
The best the OS can do is to trust the device that the data was, indeed, written to durable storage. Unfortunately, many devices lie about that. If you do a `F_FULLSYNC`, you can say you did your best, but the data is out of your hands now.
> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.
So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.
What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.
> The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Yeah you can definitely write a transactional database without having to rely on knowing you've flushed data to disk. Not only can you, but you surely have to otherwise you risk data corruption e.g. when there's a power-cut mid-write.
That's not defence. It fails the principle of least-surprise. If everyone's experience is that fsync is flushing then why would somebody think to look up the docs for Mac in case they do it differently?
These machines are actually low-power enough that you could implement a last-gasp flush mechanism. The Mac Mini already survives 1-2 seconds without AC power (at least if idle). You could plausibly detect AC power being yanked and immediately power down all downstream USB/TB3 devices and the display (on iMacs), freeze all CPUs into idle, and have plenty enough reservoir cap to let NVMe issue a flush.
But they aren't doing that. I tested it on the Mac Mini. It loses several seconds of fsync()ed data on hard shutdown.
This does require a last-gasp indication from the PSU to the rest of the system, so if they don't have that, it's not something they could add in a firmware update.
Hmm as slow as that is, does the controller support VERIFY? Because there is FUA in verify which forces the range to flush as well, and it could be used as a range flush. Depending on how they implement the disk cache its possible that is faster than a full cache walk (which is likely what they are doing).
This is one of those things that SCSI was much better at, SYNC CACHE had a range option which could be used to flush say particular files/database tables/objects/whatever to nonvolatile storage. Of course out of the box Linux (and most other OSs) don't track their page/buffer caches closely enough to pull this off, so that fsync(fileno) is closer to sync(). So, few storage systems implemented it properly anyway.
The choice of ignoring flushes vaguely makes sense if you assume the mac's SSD is in a laptop with a battery. In theory then the disk cache is non-volatile (and this assumption is made on various enterprise storage arrays with battery backup as well, although frequently its a controller setting). But i'm guessing someone just ignored the case of the mac mini without a battery.
I assumed the barrier was doing something like that, but marcan was able to inspect the actual nvme commands issued and has confirmed thats not the case.
But that would be awesome, especially with these ever growing cache capacities.
APFS at least has metadata checksums to prevent that. However it does not do data checksums (weird decision...), despite being a CoW fs with snapshotting, similar to ZFS and btrfs.
What confuses me about this is why are they so slow with F_FULLSYNC? Since that's the equivalent of what non-Apple NVMEs do under, say, Linux, and they manage to be much faster.
The OS does not matter; it's strictly about the drive. macOS on a non-Apple SSD should be equally fast with F_FULLSYNC.
Indeed, I would very much like to know what on earth the ANS firmware is doing on flushes to make them so hideously slow. We do have the firmware blobs (for both the NVMe/ANS side and the downstream S5C NAND device controllers), so if someone is bored enough they could try to reverse engineer it... it also seems there's a bunch of debug mode options, so maybe we can even get some logs at some point.
Variants of the FSYNC story have been going on for decades now. The framing varies, but typically somebody is benchmarking IO (often in the context of database benchmarking) and discovers a curious variance by OS.
On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?
> On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?
That depends on the drive having power loss protection, which comes most of the time in the form of a capacitor that powers the drive long enough to guarantee that its buffers are flushed to persistent storage.
Consumer SSDs often do not have that, so flushing is really important there, at least if your data, or no FS corruption is important to you.
Enterprise SSDs almost always have power loss protection, so there it isn't required for consistency’s sake, albeit in-flight data that didn't hit the block device yet is naturally not protected by that, most FS handle that fine by default though.
Note that Linux, for example, does by default a periodic flush every 30s independent of caching/flush settings, so that's normally the upper limit you'd lose, depending on the workload it can be still a relatively long time frame.
Those VM tunables are about dirty OS cache, not dirty drive cache. If you fsync() a file on Linux it will be pushed to the drive and (if the drive does not have battery/capacitor-backed cache) flushed from drive cache to stable storage. If you don't fsync() then AIUI all bets are off, but in practice the drive will eventually get around to flushing your data anyway. The OS has one timeout for cache flushes and the drive should have another one, one would hope.
Think about what's going on in the controller running any page access SSD.
You have wear leveling trying to keep things from blowing holes in certain physical pages. In certain cell architectures you can only write to pages that have previously been erased. Once you do write the data to the silicon... it's not really written anyway, because the tables and data structures that map that to the virtual table the host sees on boot also have to be written.
It is entirely reasonable that a system that does 100k honest sustained write I/O per second would come to its knees if you're insistent enough to actually want a full, real, power cycle proof, sync.
To do an actual full sync, where it could come back from power off... requires flushing all of those layers. Nothing is optimized to do that. I'm amazed that it can happen 40 times per second.
It's possible that you could speed this up a bit, but somewhere there's an actual non-wear leveled single page of data that tells the drive how to remap things to be useful... I strongly suspect writing that page frequently would eat the drive life up in somewhere between 0.1 and 20 million cycles. After that point, the drive would be toast.
I agree with the other thread that actually flushing is likely to be a very, very well guarded bit of info.
Good question. I just started up a loop doing USB-PD hard reboots on my MBA every 18 seconds (that's about one second into the desktop with autologin on, where it should still be doing stuff in the background). Let's see if it eats itself.
Laptops are fine unless your battery has issues and you get occasional power losses, which seems to be not too uncommon for third-party batteries (which themselves are not too uncommon since Apple will charge you an arm and a leg to replace half your laptop if you have a defective battery).
Bad batteries generally allow for last-gasp handling, and I've definitely seen the SMC throw a fit on some properties a few seconds before shutdown due to the battery being really dead. Not sure if macOS handles this properly, but I'd hope it does, and if it doesn't they could certainly add the feature. It would be quite an extreme case to have a battery failure be so sudden the voltage doesn't drop slowly enough to invoke this.
Does anyone here run a desktop Mac without a battery backup device?
All of my Macs are either laptops or have a hardware backup device, so unlikely a write would be lost due to power failure (unless backup device failed which could happen).
Laptops have batteries, so an AC power failure doesn't mean they immediately crash: they just keep running on battery until the battery gets low, at which point the system cleanly hibernates.
As a laptop user I would probably opt to make the same choice as Apple here. I like the idea mentioned to allow a tunable parameter to only allow ever losing 1 second of data.
Although, I also have the seemingly rare opinion here that ECC ram doesn't really matter on a laptop or desktop.
You presumably don't reboot your laptop by connecting a USB-PD gadget that issues a hard reset. A normal OS reboot is fine, that will flush the cache.
The most common situation where this would affect laptops, in my experience so far, would be a broken driver causing a kernel lockup (not a panic) which triggers a watchdog reboot. That situation wouldn't allow for an NVMe flush.
In my use. Yes. I didn’t realize this was the reason until I saw this thread, and now I’ve tested it. Luckily, I don’t do massive data transfers nor do I do any large data work. When I got my M1 Mac Mini, however, I did and had immediate buyer’s remorse. I thought that I/O must be terrible on this thing, and I felt cheated. After the initial stand-up, I wasn’t so angry. For most tasks, it’s faster than my old TR4 1950X.
Sure, but I do not think it’s due to their feeling that their own software is inferior. I think much more of that is cost. They needn’t pay to develop yet another OS variant, and instead benefit off of the open source community and their past contributions to said community.
They wouldn't really need to develop a variant. Plenty of people used to run servers on macOS just configured to be headless. It just doesn't meet the standard anymore.
The funny thing here is that battery-backed enterprise systems are worse off in that manner, because you're much more likely to notice a dying battery that your entire device relies on than the little battery pack hooked up to your RAID array.
Sure, you could write a program that periodically checks the battery rate (you'd have to poll since there's no ACPI notification like with a "device battery") and sends an email to the admin or something. However that's a tool that doesn't "exist" (as in, there isn't notable program that does so) which possibly hints that this isn't something system admins often do.
The above also requires there to be an interface available from userland, not only in the management firmware or BIOS/UEFI. That exists for HP, but I'm not sure all other OEMs do so.
To emulate a flushing SSD, the signal really needs to go directly to the SSD firmware so it can decide which is the last OS write it can accept while still having enough power to persist all write and flush requests it has already accepted.
Getting all that right sounds so hard it is probably better to just have enterprise SSD's have a built in supercap to give 5 seconds or so of power to do all the necessary flushing, and for laptop/desktop grade SSD's they only need to offer barriers for data consistency. Laptop and desktop users don't care if they lose the last 1 second of data before a crash as long as what is on the drive is self consistent.
I should've been a little clearer; by "enterprise systems" I was referring to RAID controllers and the like. Though yes, I believe enterprise SSDs/NVMes likely have a capacitor or, as one friend put it, an "overkill battery" to use for flushing data.
To be fair though, I sidetracked from the discussion at hand. The issue Marcan described was regarding the OS -> Disk rather than a "power loss situation". The latter does play in with the former, but solving the latter doesn't necessarily solve the former.
Enterprise system have monitoring through the BIOS which will send an email, expose the status via SNMP and other method of monitoring (same as having a faulty fan).
Correct me if I'm wrong, but I wouldn't call the management engine (eg. HP iLO) the BIOS. Whilst those may support such warnings:
1) Not everyone wants to use iLO or whatever equivalent another OEM provides.
2) Whilst such systems do support sending warnings about system components via email, dashboards, etc. that doesn't mean they'll necessarily warn about a RAID controller's battery being depleted. If I remember correctly, iLO4 doesn't.
3) What about RAID cards like the P420 (*not* the P420i) that either aren't hooked up to a management engine or are from an entirely separate OEM?
That's the first time I've heard of batteries [for RAID controllers] having an entirely separate port than that which hooks them up to the controller. Is this a "there are some of X" or have I just been out of the loop?
The dirty secret about today's high density NAND is that tPROG is not fast. It's an order of magnitude slower than the heyday of SLC. Now that doesn't really matter for enterprise drives, they complete writes into very fast storage that is made durable one way or another (e.g., flush on power fail), and this small store gets streamed out to the NAND log asynchronously. This is why random single queue depth durable writes can actually be faster than reads on enterprise drives, because random reads have to come from NAND (tREAD is still very fast, just not as fast as writing to DRAM).
Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff. So they might have to flush to NAND on any FUA, and that's slow as we've said, but not 25ms slow. Modern QLC NAND tPROG latency is more like 2.5ms-5ms, which could just about explain the EVO results when you include the OS and SATA stack and drive controller.
There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion. It would have been a deliberate design choice for some reason. One possible reason that comes to mind is that some drives gang a bunch of chips in parallel end you end up with pretty big "logical" pages. Flushing a big logical page on a 4kB write is going to cause a lot of write amp and drive wear, so you might delay for a short period (20ms) to try pick up other writes and reduce your inefficiency.
Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate. The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion
There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
And then they put this hardware into desktops. Oops :-)
Lots of things about the M1 were rushed and have been fixed along the way. I wouldn't be in the least bit surprised if this were one more of them that gets fixed a couple macOS versions down the line, now that I've made some noise about it.
> Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate.
How are you measuring that and how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.
> The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
Yes. It is clear the hardware was never optimized for it. Because it is so slow. I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice. It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.
NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow. Flushing data out of your cache into the log is the FTL's bread and butter. It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.
Pull blocks from the buffer in order and build pages, allocate pages in NAND to send them, update forward map, repeat.
The details get very complicated and proprietary. NAND wears out as you use it. But it also has a retention time. It gradually loses charge and won't read back if you leave it unpowered for long enough. This is actually where enterprise drives can be speced worse than consumer. So durability / lifetime is specified as meeting specified uncorrected error rates at the given retention period. The physics of NAND are pretty interesting too and how it translates into how a controller optimizes these parameters. Temperature at various stages of operation and retention changes properties, time between erase and program does too. You can adjust voltages on read, program, erase, and those can help you read data out or change the profile of the data. Reading can disturb parts of other pages (similar to rowhammer). Multilevel cells are actually interesting some of them you program in passes so that's a whole other spanner in the works.
I don't know of a good place that covers all that, but much beyond "read/program/erase + wear + retention" is probably beyond "what every programmer should know".
The way you turn a bunch of NAND chips that have a "read/program/erase" programming model into something that has a read/write model (the flash translation layer or FTL) is a whole other thing again though. And all the endurance management and optimization, error correction... Pretty fascinating details really. The basic details though is that they use the same concepts as the "log structured filesystem", turns out a log structure with garbage collection is about a perfect it for turning the program/erase model into a random write model. That's probably what every programmer should know about that (assuming you know something about LSFs -- garbage collection, write amplification, forward and reverse mapping schemes, etc).
> Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff.
I disagree with this - my Apple is an enterprise device. It's a Macbook Pro, issued by my employer, to do real work. I wouldn't give Apple a pass on this dimension. I get that the "Pro" label doesn't mean what it used to, but these aren't toys either.
Slightly related: if a drive runs with a properly journaled, fully checksummed filesystem, for example zfs or btrfs - does the write-through mode guarantee that you can only lose new data and not corrupt the old?
ZFS is not journaled. CoW eliminates the need for anything like a journal with the exception of synchronous IO, where an intent log is used that can be replayed after a power loss event.
In any case, ZFS should be fine as long as REQ_PREFLUSH is working properly. You can read a little about that here:
No, you won't see corruption on ZFS. Cutting power to the drive is always safe, you can slice a SATA cable with a guillotine if you want, you'll always see a consistent state of the filesystem. ZFS transactions are entirely atomic.
ZFS (and btrfs) is not "journaled", it's copy-on-write.
> Of course, in normal usage, this is basically never an issue on laptops; given the right software hooks, they should never run out of power before the OS has a chance to issue a disk flush command
I guess a UPS powerbackup would be useful. Laptops basically have built-in UPS which is perhaps why Apple has gone in that direction. I wonder if their high-end desktops with Apple Silicon will do something different there.
I dug a bit further and the NVMe controller is doing about 6.2MB/s of DRAM reads and 10MB/s of DRAM writes while doing a flush loop like this (which it isn't doing with the same traffic sans the flushes). I wonder if it's doing something dumb like linear scanning a cache hash table to find things to flush... or maybe something with bad cache locality?
I'm pretty sure, whatever it is, Apple could fix it in a firmware update.
> fsync() will both flush writes to the drive, and ask it to flush its write cache to stable storage.
Can someone explain what "flushing write cache to stable storage" means? Isn't that the same as "writes to the drive". I am obviously not well versed in this area. Also what is stable storage? Never heard that term before.
SSDs and other storage drives have two layers (or more). The last layer is stable storage (= when you disconnect power no data is lost or corrupted). When you write to such a device your writes are first made in an earlier layer that is more like your computer’s main memory than actual storage (when you lose power your data is gone or corrupted). Only after time or when the cache is full an actual persistent write is made.
It's still not clear why Apple SSD so slow. Surely there's more to it. May be other SSDs are cheating in firmware? Or may be it's just bug in Apple firmware? I'm really interested if there will be follow ups on Apple side.
Since this design is inherited from iDevices, my guess is they never bothered to optimize this command since software on a battery-powered device would almost never need to issue it. It should be something they can improve in firmware.
From my understanding, the thing that’s slow is writing data to “permanent storage” (aka the layer under all the caching).
Some storage tech is just slow at that, and manufacturers muddy the water by rating some (SSDs|Micro SDs|whatever) in GB/s overall when much of those big numbers are a combination of caches and trickery.
I would not be surprised if Apple is using a tech that just has slow write speeds in trade for fast read speeds since most Apple users will be happy with faster read speeds.
I'm not here to defend apple, but if you have a desktop and you don't want to lose data then get a UPS. Proper write handling on the disk wont help if you haven't saved your doc in ten minutes.
They do use system RAM as cache, but that has no effect on performance. If anything it should be way faster than the puny RAM cache chips on typical SSDs. It doesn't explain the slow flush perf.
Honestly I don’t know. The order-of-magnitude performance difference in deferring the flush feels worth it to me if the risk is mitigated to sudden power loss.
I would think when the last of Apple’s hardware moves to ARM they’ll ensure there’s enough onboard battery to ensure the flushes happen reliably across form factors even if there’s a power cut.
If anything, now that the reason for the performance difference has been identified, I’d hope to see numbers for Linux and Windows storage access come up to par with these numbers as they go down this road too (e.g. via the NVME flush toggle mentioned in the article).
Yeah. If the same thing happens to a brand-less garbage SSD you purchased from Aliexpress it's clearly cheat and plainly malicious and incompetence, but the Apple tag certainly made us believe there is a second reason.
Trading correctness for performance without shouting at the users "YOUR DATA IS NOT SAFE WHEN YOU DO THIS" multiple times a day in a storage is benchmark-snake-oil. Period.
They just direct people to use Time Machine or iCloud, then look quizzically at you when have issue with writing off lost hours of work as cost of doing business.
“You can lose some of your file changes in case of hard-reboot” is more correct.
It was a given truth for me all the time and I can tolerate some data losses if power was accidentally turned off for my desktop, or if OS panicked (it happens ~ once per year to me).
If this is a price for a 1000x speed increase - I’m more than happy they have implemented it this way.
You can lose some file changes even after asking the OS to make sure they don't get lost, the normal way.
That's a problem. It means e.g. transactional databases (which cannot afford to lose data like that) have a huge performance hit on these machines, since they have to use F_FULLFSYNC. And since that "no really, save my data" feature is not the standard fsync(), it means any portable software compiled for Linux will be safe, but will be unsafe on macOS, by default. That is a significant gotcha.
The question is why do other NVMe manufacturers not have such a performance penalty? 10x is fine; 1000x is not. This is something Apple should fix. It's a firmware problem.
I guess I am old but the assumption I live by is that if power is suddenly cut from a computer - no matter desktop or laptop - it can damage the FS and/or cause data loss.
For any mission critical stuff, I have it behind a UPS.
At least your thinking is old. Modern filesystems and databases are designed to prevent data loss in that scenario.
The last time I saw a modern filesystem eat itself on sudden power loss was when I was evaluating btrfs in a datacenter setting, and that absolutely told me it was not a reliable FS and we went with something else. I've never seen it happen with ext4 or XFS (configured properly) in over a decade, assuming the underlying storage is well-behaved.
OTOH, I've seen cases of e.g. data in files being replaced by zeroes and applications crashing due to that (it's pretty common that zsh complains about .zsh_history being corrupted after a crash due to a trailing block of zeroes). This happens when filesystems are mounted with metadata journaling but no data journaling. If you use data journaling (or a filesystem designed to inherently avoid this, e.g. COW cases), that situation can't happen either. Most databases would be designed to gracefully handle this kind of situation without requiring systemwide data journaling though. That's a tradeoff that is available to the user depending on their specific use case and whether the applications are designed with that in mind or not.
I've been using Macs (both desktop and laptops) since I have memory. I've had the M1 since launch day, and I use it all day, both for work and personal use.
Why this never happened to me? Why I don't know anyone which had this problem? Why nobody is complaining as it happened with the previous gen keyboards?
I think we might be missing something in this analysis. I don't think Apple engineers are idiots.
Most people don't unplug their Mac Mini in the middle of working, and most users who do lose data after that happens would just think it's normal and not realize there is an underlying problem and modern OSes aren't supposed to do that.
I've seen APFS filesystems eat themselves in production (and had to do data recovery), twice. Apple don't have a perfect data integrity track record.
On laptop, you would get data loss / corruption on sudden power loss. This is rare. With "flush to storage device's RAM", even a kernel panic would not lose data if you let the storage device flush to flash without power loss.
This F_FULLFSYNC behaviour has been like this on OSX for as long as I can remember. It is a hint to ensures that the data in the write buffer has been flushed to stable storage - this is historically a limitation of fsync that is being accounted for - are you 1000% sure it does as you expect on other OSes?
POSIX spec says no: https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...
Maybe unrealistic expectation for all OSes to behave like linux.
Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.
Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.
Yes. fsync() on Linux pushes down to stable storage, not just drive cache.
OpenBSD, though, apparently behaves like macOS. I'm not sure I like that.
Last time I checked (which is a while at this point, pre SSD) nearly all consumer drives and even most enterprise drives would lie in response to commands to flush the drive cache. Working on a storage appliance at the time, the specifics of a major drive manufacturer's secret SCSI vendor page knock to actually flush their cache was one of the things on their deepest NDAs. Apparently ignoring cache flushing was so ubiquitous that any drive manufacturer looking to have correct semantics would take a beating in benchmarks and lose marketshare. : \
So, as of about 2014, any difference here not being backed by per manufacturer secret knocks or NDAed, one-off drive firmware was just a magic show, with perhaps Linux at least being able to say "hey, at least the kernel tried and it's not our fault". The cynic in me thinks that the BSDs continuing to define fsync() as only hitting the drive cache is to keep a semantically clean pathway for "actually flush" for storage appliance vendors to stick on the side of their kernels that they can't upstream because of the NDAs. A sort of dotted line around missing functionality that is obvious 'if you know to look for it'.
It wouldn't surprise me at all if Apple's NVME controller is the only drive you can easily put your hands on that actually does the correct things on flush, since they're pretty much the only ones without the perverse market pressure to intentionally not implement it correctly.
Since this is getting updoots: Sort of in defense of the drive manufacturers (or at least stating one of the defenses I heard), they try to spec out the capacitance on the drive so that when the controller gets a power loss NMI, they generally have enough time to flush then. That always seemed like a stretch for spinning rust (the drive motor itself was quite a chonker in the watt/ms range being talked about particularly considering seeks are in the 100ms range to start with, but also they have pretty big electrolytic caps on spinning rust so maybe they can go longer?), but this might be less of a white lie for SSDs. If they can stay up for 200ms after power loss, I can maybe see them being able to flush cache. Gods help those HMB drives though, I don't know how you'd guarantee access to the host memory used for cache on power loss without a full system approach to what power loss looks like.
13 replies →
So its basically implementation specific, and macOS has its own way of handling it.
That doesnt make it worse - in fact it permits the flexibility you are now struggling with.
edit: downvotes for truth? nice. go read the posix spec then come back and remove your downvotes...
35 replies →
Something that is not quite clear to me yet (I did read the discussion below, thank you Hector for indulging us, very informative): isn't the end behaviour up to the drive controller? That is, how can we be sure that Linux actually does push to the storage or is it possible that the controller cheats? For example, you mention the USB drive test on a Mac — how can we know that the USB stick controller actually does the full flush?
Regardless, I certainly agree that the performance hit seems excessive. Hopefully it's just an algorithm, issue and Apple can fix this with a software update.
*BSDs mostly followed this semantic, as I recall. Probably inherited from a common ancestor.
8 replies →
Linux does that now. It didn't in the past (something like 2008), and I recall many people arguing about performance or similar at that time :D
I like that. Fsync() was designed with the block cache in mind. IMO how the underlying hardware handles durability is its own business. I think a hack to issue a “full fsync” when battery is below some threshold is a good compromise.
It's important to read the entire document including the notes, which informs the reader of a pretty clear intent (emphasis mine):
> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.
This seems consistent with user expectations - fsync() completion should mean data is fully recorded and therefore power-cycle- or crash-safe.
You are quoting the non-normative informative part. If _POSIX_SYNCHRONIZED_IO is not defined, your fsync can literally be this and still be compliant:
Quick Google search (maybe someone with a MBP can confirm) says that macOS doesn't purport to implement SIO.
7 replies →
At least it is also implemented by windows, which cause apt-get in hyperv vm slower
And also unbearable slow for loopback device backed docker container in the vm due to double layer of cache. I just add eat-my-data happily because you can't save a half finished docker image anyway.
> Also note that 3rd party drives are known to ignore F_FULLFSYNC
SQLite, MySQL et al. [1] fall back to `fsync()` if F_FULLFSYNC fails, in order to cover this case of 3rd party or external drives.
[1] https://twitter.com/TigerBeetleDB/status/1422855270716293123
OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it? I don't have one at hand but IIRC it did.
At least the OSX man page admits to the detail.
The rationale in the POSIX document for a null implementation seems reasonable (or at least plausible), but it does not really seem to apply to general OSX systems at all. So even if they didn't define _POSIX_SYNCHRONIZED_IO it would be against the spirit of the specification.
I'm actually curious why they made fsync do anything at all though.
> OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it?
Nope: https://opensource.apple.com/source/Libc/Libc-1439.40.11/inc...
6 replies →
OP appears to be giving useful information about OSX, regardless of what other OSes do.
The implication (in fact no, it's explicitly stated) is that this fsync() behaviour on OSX will be a surprise for developers working on cross platform code or coming from other OS's and will catch them out.
However if in fact it's quite common for other OS's to exhibit the same or similar behaviour (BSD for example does this too, which makes sense as OSX has a lot of BSD lineage), that argument of least surprise falls a bit flat.
That's not to say this is good behaviour, I think Linux does this right, the real issue is the appalling performance for flushing writes.
The POSIX specification requires data to be on stable storage following fsync. Anything less is broken behavior.
An fsync that does not require the completion of an IO barrier before returning is inherently broken. This would be REQ_PREFLUSH inside Linux.
> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system.
> fsync() might or might not actually cause data to be written where it is safe from a power failure.
How are you reading POSIX as "saying no"??
From that page:
then:
The only reason to doubt the clarity of the above is that POSIX does not consider crashes and power failures to be in scope. It says so right in the quoted text.
Crashes and power failures are just not part of the POSIX worldview, so in POSIX there can be no need for sync(2) or fsync(2), or fcntl(2) w/ F_FULLFSYNC! Why even bother having those system calls? Why even bother having the spec refer to the concept at all?
Well, the reality is that some allowance must be made for crashes and power failures, and that includes some mechanism for flushing caches all the way to persistent storage. POSIX is a standard that some real-life operating systems aim to meet, but those operating systems have to deal with crashes and power failures because those things happen in real life, and because their users want the operating systems to handle those events as gracefully as possible. Some data loss is always inescapable, but data corruption would be very bad, which is why filesystems and applications try to do things like write-ahead logging and so on.
That is why sync(2), fsync(2), fdatasync(2), and F_FULLFSYNC exist. It's why they [well, some of them] existed in Unix, it's why they still exist in Unix derivatives, it's why they exist in Unix-alike systems, it's why they exist in Windows and other not-remotely-POSIX operating systems, and it's why they exist in POSIX.
If they must exist in POSIX, then we should read the quoted and linked page, and it is pretty clear: "transferred to the storage device" and "intended to force a physical write" can only mean... what that says.
It would be fairly outrageous for an operating system to say that since crashes and power failures are outside the scope of POSIX, the operating system will not provide any way to save data persistently other than to shut down!
> transferred to the storage device
MacOS does that.
> the fsync() function is intended to force a physical write of data from the buffer cache
If they define _POSIX_SYNCHRONIZED_IO, which they dont.
fsync wasnt defined as requiring a flush until version 5 of the spec. It was implemented in BSDs loooong before then. Apple introduced F_FULLFSYNC prior to fsync having that new definition.
I dont disagree with you, but it is what it is. History is a thing. Legacy support is a thing. Apple likely didnt want to change peoples expectations of the behaviour on OSX - they have their own implementation after all (which is well documented, lots of portable software and libs actively uses it, and its built in to the higher level APIs that Mac devs consume).
1 reply →
Do you mean on Linux that calling fsync might not actually flush to the drive?
How many hundreds of millions of people use OSX over the years and never encountered any problems whatsoever?
This article is a non-issue, people just like to upvote Apple bashing.
If you need to run software/servers with any kind of data consistency/reliability on OS X this is definitely something you should be aware of and will be a footgun if you're used to Linux.
Macs in datacentres are becoming increasingly common for CI, MDM, etc.
12 replies →
"never encountered any problems whatsoever?"
And how do you know they didn't, did you do a poll?
How many people had random files dissapear or get corrupted or settings get reset and probzbly thought they must have done something wrong?
Fantastic thread.
The history is also interesting. It's not that "macOS cheats", but that it sincerely inherited the status quo of many years, then tried to go further by adding F_FULLFSYNC. However, Linux since got better, leaving macOS stuck in the past and everybody surprised. It's a big problem.
Here's Dominic Giampaolo from Apple discussing this back in 2005, before Linux fixed fsync() to flush past the disk cache: https://lists.apple.com/archives/darwin-dev/2005/Feb/msg0008...
And here's TigerBeetle's Twitter thread with more of the history and how projects like LevelDB, SQLite and various language std libs were also affected: https://twitter.com/TigerBeetleDB/status/1422854779009654785
Docs [1] suggests that even F_FULLFSYNC might not be enough. Quote:
> Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.
[1] https://developer.apple.com/documentation/xcode/reducing-dis...
When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Note that the man page for `F_FULLSYNC` itself doesn't mention that it is not reliable: https://developer.apple.com/library/archive/documentation/Sy...
Having a separate syscall is annoying, but workable. Having a scenario where we call flush and cannot ensure that this is the case is BAD. Note that handling flush failures is expected, but all databases require that flushing successfully will make the data durable.
Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
I checked a few and they seem to do F_FULLFSYNC (sic), except MySQL, they deleted it to make it run faster:
https://github.com/mysql/mysql-server/commit/3cb16e9c3879d17...
4 replies →
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
The best the OS can do is to trust the device that the data was, indeed, written to durable storage. Unfortunately, many devices lie about that. If you do a `F_FULLSYNC`, you can say you did your best, but the data is out of your hands now.
3 replies →
> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.
So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.
What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.
13 replies →
"Silly wabbit, database trix are for servers!"
> The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Yeah you can definitely write a transactional database without having to rely on knowing you've flushed data to disk. Not only can you, but you surely have to otherwise you risk data corruption e.g. when there's a power-cut mid-write.
1 reply →
Lol, but hey, macs are not servers, so "hahah who cares!".
In Apple defense, sloppy fsync behaviour is clearly documented: https://developer.apple.com/library/archive/documentation/Sy...
That's not defence. It fails the principle of least-surprise. If everyone's experience is that fsync is flushing then why would somebody think to look up the docs for Mac in case they do it differently?
>That's not defence. It fails the principle of least-surprise.
Only if the standard where anything else is a "surprise" is 2022 Linux.
Many (all?) other unices and macOS itself since forever work like that. Including Linux itself in the past [1]
[1] https://lwn.net/Articles/270891/
6 replies →
They do it according to POSIX spec. Linux is the oddball here.
11 replies →
> That's not defence. It fails the principle of least-surprise.
Welcome to C APIs in general, and POSIX in particular.
> why would somebody think to look up the docs
It seems reckless to me to not do this when you're interacting with the filesystem using low-level APIs (i.e not via Swift/Obj-C).
Linux only stopped doing the clearly wrong thing in 2008 or so iirc.
It is still dumb that there's a definition of fsync() that does not sync :-/
I’d argue maybe .5% of people are working on something where this is even close to being a concern. Those people probably know what they need to use.
Apple doesn’t need to defend anything.
1 reply →
As the article mentions, on laptops, this is pretty clever. On desktops though...
Perhaps real macs should be equipped with internal batteries to flush to disk in the case of power loss?
I think I heard some enterprise motherboards/controllers/computers did just that, given the upside in normal operation.
These machines are actually low-power enough that you could implement a last-gasp flush mechanism. The Mac Mini already survives 1-2 seconds without AC power (at least if idle). You could plausibly detect AC power being yanked and immediately power down all downstream USB/TB3 devices and the display (on iMacs), freeze all CPUs into idle, and have plenty enough reservoir cap to let NVMe issue a flush.
But they aren't doing that. I tested it on the Mac Mini. It loses several seconds of fsync()ed data on hard shutdown.
This does require a last-gasp indication from the PSU to the rest of the system, so if they don't have that, it's not something they could add in a firmware update.
I mean the ATX standard has this signal built in, so Apple could just copy it:
https://en.wikipedia.org/wiki/Power_good_signal
5 replies →
>But they aren't doing that. I tested it on the Mac Mini. It loses several seconds of fsync()ed data on hard shutdown.
That's unfortunate. My Mac Mini crashes every other night during sleep. I guess I'm going to have to shut it down to avoid any data corruption.
7 replies →
Even on laptops I feel uncomfortable. My macOS freezes or kernel panics on me from time to time.
I believe the NVMe driver has a kernel panic hook; I would hope it is used to issue a flush.
OTOH, if you have watchdog timeouts (I've seen this from bad drivers), those would certainly not give the kernel a chance to do that.
7 replies →
>Perhaps real macs should be equipped with internal batteries to flush to disk in the case of power loss?
Or just add a UPS?
Does disk gets flushed in case of kernel panic?
How exactly this is clever? Maybe on some toy, not on workstation!
Hmm as slow as that is, does the controller support VERIFY? Because there is FUA in verify which forces the range to flush as well, and it could be used as a range flush. Depending on how they implement the disk cache its possible that is faster than a full cache walk (which is likely what they are doing).
This is one of those things that SCSI was much better at, SYNC CACHE had a range option which could be used to flush say particular files/database tables/objects/whatever to nonvolatile storage. Of course out of the box Linux (and most other OSs) don't track their page/buffer caches closely enough to pull this off, so that fsync(fileno) is closer to sync(). So, few storage systems implemented it properly anyway.
The choice of ignoring flushes vaguely makes sense if you assume the mac's SSD is in a laptop with a battery. In theory then the disk cache is non-volatile (and this assumption is made on various enterprise storage arrays with battery backup as well, although frequently its a controller setting). But i'm guessing someone just ignored the case of the mac mini without a battery.
I assumed the barrier was doing something like that, but marcan was able to inspect the actual nvme commands issued and has confirmed thats not the case.
But that would be awesome, especially with these ever growing cache capacities.
Deferring flushes on the NVMe level could also corrupt a journaling FS itself, not just the contents of files written with proper fsync incantations.
Indeed, though that is somewhat rare. For our distro, I would opt to enable it by default on laptops (which is quite safe) and disable it on desktops.
APFS at least has metadata checksums to prevent that. However it does not do data checksums (weird decision...), despite being a CoW fs with snapshotting, similar to ZFS and btrfs.
They rely on hardware storing checksums and on protocol using checksums to prevent data corruption on all levels.
1 reply →
What confuses me about this is why are they so slow with F_FULLSYNC? Since that's the equivalent of what non-Apple NVMEs do under, say, Linux, and they manage to be much faster.
The OS does not matter; it's strictly about the drive. macOS on a non-Apple SSD should be equally fast with F_FULLSYNC.
Indeed, I would very much like to know what on earth the ANS firmware is doing on flushes to make them so hideously slow. We do have the firmware blobs (for both the NVMe/ANS side and the downstream S5C NAND device controllers), so if someone is bored enough they could try to reverse engineer it... it also seems there's a bunch of debug mode options, so maybe we can even get some logs at some point.
Drives are known to ignore that hint... Thats why you should use vendor approved hardware if such things matter to you.
Variants of the FSYNC story have been going on for decades now. The framing varies, but typically somebody is benchmarking IO (often in the context of database benchmarking) and discovers a curious variance by OS.
On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?
> On NVMes I wonder whether this really matters, but it's a serious issue on spinning disks: do you really need to flush everything to the disk (and interrupt more efficient access patterns)?
That depends on the drive having power loss protection, which comes most of the time in the form of a capacitor that powers the drive long enough to guarantee that its buffers are flushed to persistent storage.
Consumer SSDs often do not have that, so flushing is really important there, at least if your data, or no FS corruption is important to you.
Enterprise SSDs almost always have power loss protection, so there it isn't required for consistency’s sake, albeit in-flight data that didn't hit the block device yet is naturally not protected by that, most FS handle that fine by default though.
Note that Linux, for example, does by default a periodic flush every 30s independent of caching/flush settings, so that's normally the upper limit you'd lose, depending on the workload it can be still a relatively long time frame.
https://sysctl-explorer.net/vm/dirty_expire_centisecs/
Those VM tunables are about dirty OS cache, not dirty drive cache. If you fsync() a file on Linux it will be pushed to the drive and (if the drive does not have battery/capacitor-backed cache) flushed from drive cache to stable storage. If you don't fsync() then AIUI all bets are off, but in practice the drive will eventually get around to flushing your data anyway. The OS has one timeout for cache flushes and the drive should have another one, one would hope.
4 replies →
On this NVMe, flushing is slower than on some spinning disks, so it apparently matters.
Yes, I would have skipped the fsync thing, which carries a lot of baggage, and concentrate on this.
Btw, are you sure those spinning disks are actually flushing to rust? Caches all the way down... ;-)
1 reply →
Think about what's going on in the controller running any page access SSD.
You have wear leveling trying to keep things from blowing holes in certain physical pages. In certain cell architectures you can only write to pages that have previously been erased. Once you do write the data to the silicon... it's not really written anyway, because the tables and data structures that map that to the virtual table the host sees on boot also have to be written.
It is entirely reasonable that a system that does 100k honest sustained write I/O per second would come to its knees if you're insistent enough to actually want a full, real, power cycle proof, sync.
To do an actual full sync, where it could come back from power off... requires flushing all of those layers. Nothing is optimized to do that. I'm amazed that it can happen 40 times per second.
It's possible that you could speed this up a bit, but somewhere there's an actual non-wear leveled single page of data that tells the drive how to remap things to be useful... I strongly suspect writing that page frequently would eat the drive life up in somewhere between 0.1 and 20 million cycles. After that point, the drive would be toast.
I agree with the other thread that actually flushing is likely to be a very, very well guarded bit of info.
This sounds like laptops are fine, but iMacs and Minis are effed.
Curious, what's the real world risk of full OS level corruption and not just data loss?
Good question. I just started up a loop doing USB-PD hard reboots on my MBA every 18 seconds (that's about one second into the desktop with autologin on, where it should still be doing stuff in the background). Let's see if it eats itself.
Famous last words
3 replies →
How can we get notified about your results?
Laptops are fine unless your battery has issues and you get occasional power losses, which seems to be not too uncommon for third-party batteries (which themselves are not too uncommon since Apple will charge you an arm and a leg to replace half your laptop if you have a defective battery).
Bad batteries generally allow for last-gasp handling, and I've definitely seen the SMC throw a fit on some properties a few seconds before shutdown due to the battery being really dead. Not sure if macOS handles this properly, but I'd hope it does, and if it doesn't they could certainly add the feature. It would be quite an extreme case to have a battery failure be so sudden the voltage doesn't drop slowly enough to invoke this.
6 replies →
Does anyone here run a desktop Mac without a battery backup device?
All of my Macs are either laptops or have a hardware backup device, so unlikely a write would be lost due to power failure (unless backup device failed which could happen).
Sure.. last power failure was like 4 years ago and the one before that was also measured in multiple years.
Back when I still used a UPS down here, it was usually the UPS that died and triggered the power failure. So I stopped investing in a UPS.
1 reply →
Wait, why are iMacs and Minis affected more? (I read the twitter thread; I'm not seeing why.)
Laptops have batteries, so an AC power failure doesn't mean they immediately crash: they just keep running on battery until the battery gets low, at which point the system cleanly hibernates.
They're dependent on external power, which can acutely fail.
not battery powered
As a laptop user I would probably opt to make the same choice as Apple here. I like the idea mentioned to allow a tunable parameter to only allow ever losing 1 second of data.
Although, I also have the seemingly rare opinion here that ECC ram doesn't really matter on a laptop or desktop.
It's not only losing a couple seconds of data. Write ordering does not work, meaning journals don't. You get a possibility of silent data corruption.
But apple could quite easily fix write ordering
1 reply →
> only allow ever losing 1 second of data
For a database this means that every transaction will take a minimum of 1 second, otherwise you can't guarantee durability.
You think it's oke that restarting your PC leads to data loss or corruption? That's basically a product killer for me. I reboot my laptop everyday.
You presumably don't reboot your laptop by connecting a USB-PD gadget that issues a hard reset. A normal OS reboot is fine, that will flush the cache.
The most common situation where this would affect laptops, in my experience so far, would be a broken driver causing a kernel lockup (not a panic) which triggers a watchdog reboot. That situation wouldn't allow for an NVMe flush.
11 replies →
Shouldn't Mac OS issue flush on restart, as it does on sleep?
1) A normal restart doesn't have this issue, at all.
2) Why are you rebooting a laptop daily? My uptime on my MacBook Pro averages 30-60 days. There's zero reason to reboot any modern OS daily.
1 reply →
I wonder if you hit the drive hard enough, so that the cache gets filled, does the performance degrade by that same magnitude?
In my use. Yes. I didn’t realize this was the reason until I saw this thread, and now I’ve tested it. Luckily, I don’t do massive data transfers nor do I do any large data work. When I got my M1 Mac Mini, however, I did and had immediate buyer’s remorse. I thought that I/O must be terrible on this thing, and I felt cheated. After the initial stand-up, I wasn’t so angry. For most tasks, it’s faster than my old TR4 1950X.
There’s a reason why Apple uses Linux for its server infrastructure.
Sure, but I do not think it’s due to their feeling that their own software is inferior. I think much more of that is cost. They needn’t pay to develop yet another OS variant, and instead benefit off of the open source community and their past contributions to said community.
They wouldn't really need to develop a variant. Plenty of people used to run servers on macOS just configured to be headless. It just doesn't meet the standard anymore.
1 reply →
The funny thing here is that battery-backed enterprise systems are worse off in that manner, because you're much more likely to notice a dying battery that your entire device relies on than the little battery pack hooked up to your RAID array.
Sure, you could write a program that periodically checks the battery rate (you'd have to poll since there's no ACPI notification like with a "device battery") and sends an email to the admin or something. However that's a tool that doesn't "exist" (as in, there isn't notable program that does so) which possibly hints that this isn't something system admins often do.
The above also requires there to be an interface available from userland, not only in the management firmware or BIOS/UEFI. That exists for HP, but I'm not sure all other OEMs do so.
To emulate a flushing SSD, the signal really needs to go directly to the SSD firmware so it can decide which is the last OS write it can accept while still having enough power to persist all write and flush requests it has already accepted.
Getting all that right sounds so hard it is probably better to just have enterprise SSD's have a built in supercap to give 5 seconds or so of power to do all the necessary flushing, and for laptop/desktop grade SSD's they only need to offer barriers for data consistency. Laptop and desktop users don't care if they lose the last 1 second of data before a crash as long as what is on the drive is self consistent.
I should've been a little clearer; by "enterprise systems" I was referring to RAID controllers and the like. Though yes, I believe enterprise SSDs/NVMes likely have a capacitor or, as one friend put it, an "overkill battery" to use for flushing data.
To be fair though, I sidetracked from the discussion at hand. The issue Marcan described was regarding the OS -> Disk rather than a "power loss situation". The latter does play in with the former, but solving the latter doesn't necessarily solve the former.
Enterprise system have monitoring through the BIOS which will send an email, expose the status via SNMP and other method of monitoring (same as having a faulty fan).
Correct me if I'm wrong, but I wouldn't call the management engine (eg. HP iLO) the BIOS. Whilst those may support such warnings:
1) Not everyone wants to use iLO or whatever equivalent another OEM provides.
2) Whilst such systems do support sending warnings about system components via email, dashboards, etc. that doesn't mean they'll necessarily warn about a RAID controller's battery being depleted. If I remember correctly, iLO4 doesn't.
3) What about RAID cards like the P420 (*not* the P420i) that either aren't hooked up to a management engine or are from an entirely separate OEM?
1 reply →
External batteries can often be connected to via serial (most common), via USB or via IP, so that is definitely one.
That's the first time I've heard of batteries [for RAID controllers] having an entirely separate port than that which hooks them up to the controller. Is this a "there are some of X" or have I just been out of the loop?
The dirty secret about today's high density NAND is that tPROG is not fast. It's an order of magnitude slower than the heyday of SLC. Now that doesn't really matter for enterprise drives, they complete writes into very fast storage that is made durable one way or another (e.g., flush on power fail), and this small store gets streamed out to the NAND log asynchronously. This is why random single queue depth durable writes can actually be faster than reads on enterprise drives, because random reads have to come from NAND (tREAD is still very fast, just not as fast as writing to DRAM).
Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff. So they might have to flush to NAND on any FUA, and that's slow as we've said, but not 25ms slow. Modern QLC NAND tPROG latency is more like 2.5ms-5ms, which could just about explain the EVO results when you include the OS and SATA stack and drive controller.
There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion. It would have been a deliberate design choice for some reason. One possible reason that comes to mind is that some drives gang a bunch of chips in parallel end you end up with pretty big "logical" pages. Flushing a big logical page on a 4kB write is going to cause a lot of write amp and drive wear, so you might delay for a short period (20ms) to try pick up other writes and reduce your inefficiency.
Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate. The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 0% chance Apple would have messed this up accidentally though, in my opinion
There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
And then they put this hardware into desktops. Oops :-)
Lots of things about the M1 were rushed and have been fixed along the way. I wouldn't be in the least bit surprised if this were one more of them that gets fixed a couple macOS versions down the line, now that I've made some noise about it.
> Nope, it's not a deliberate optimization / delay. Doing the flushes creates an extra ~10MB/s of DRAM memory traffic from the NVMe controller vs. not doing them while creating the same write rate.
How are you measuring that and how do you figure it means the NAND writes are not being held off? Clearly they are by one means or another.
> The firmware is doing something dumb when issued a flush command, it's not just sitting around and waiting.
> There's pretty close to 100% chance Apple would not have cared/optimized for this when designing this SSD controller, because it was designed for iOS devices which always have a battery, and where next to no software would be issuing flushes.
Yes. It is clear the hardware was never optimized for it. Because it is so slow. I'm almost certain that is a deliberate choice, and delaying the update is a possible reason for that choice. It's pretty clear the hardware can run this much faster, because it does when it's streaming data out.
NAND and the controller and FTL just isn't rocket science that you'd have hardware that can sustain the rates that Apple's can and then through some crazy unforeseen problem this would suddenly go slow. Flushing data out of your cache into the log is the FTL's bread and butter. It doesn't suddenly become much more complicated when it's a synchronous flush rather than a capacity flush, it's the same hardware data and control paths, the same data structures in the FTL firmware and would use most of the same code paths even.
Pull blocks from the buffer in order and build pages, allocate pages in NAND to send them, update forward map, repeat.
7 replies →
I don't know what tPROG is (or anything else), is there a "What every programmer should know about storage" a la Drepper's work on memory?
tPROG is time it takes to program a NAND page from when you put the "program page" command on the pins to when you read off a successful status.
Some of the basic NAND guides they put out are simple enough to understand the basics of operation
https://www.micron.com/-/media/client/global/documents/produ...
The details get very complicated and proprietary. NAND wears out as you use it. But it also has a retention time. It gradually loses charge and won't read back if you leave it unpowered for long enough. This is actually where enterprise drives can be speced worse than consumer. So durability / lifetime is specified as meeting specified uncorrected error rates at the given retention period. The physics of NAND are pretty interesting too and how it translates into how a controller optimizes these parameters. Temperature at various stages of operation and retention changes properties, time between erase and program does too. You can adjust voltages on read, program, erase, and those can help you read data out or change the profile of the data. Reading can disturb parts of other pages (similar to rowhammer). Multilevel cells are actually interesting some of them you program in passes so that's a whole other spanner in the works.
I don't know of a good place that covers all that, but much beyond "read/program/erase + wear + retention" is probably beyond "what every programmer should know".
The way you turn a bunch of NAND chips that have a "read/program/erase" programming model into something that has a read/write model (the flash translation layer or FTL) is a whole other thing again though. And all the endurance management and optimization, error correction... Pretty fascinating details really. The basic details though is that they use the same concepts as the "log structured filesystem", turns out a log structure with garbage collection is about a perfect it for turning the program/erase model into a random write model. That's probably what every programmer should know about that (assuming you know something about LSFs -- garbage collection, write amplification, forward and reverse mapping schemes, etc).
1 reply →
> Apple may not implement such a durable cache, that's fine it's not an enterprise device and it's a cost tradeoff.
I disagree with this - my Apple is an enterprise device. It's a Macbook Pro, issued by my employer, to do real work. I wouldn't give Apple a pass on this dimension. I get that the "Pro" label doesn't mean what it used to, but these aren't toys either.
Slightly related: if a drive runs with a properly journaled, fully checksummed filesystem, for example zfs or btrfs - does the write-through mode guarantee that you can only lose new data and not corrupt the old?
ZFS is not journaled. CoW eliminates the need for anything like a journal with the exception of synchronous IO, where an intent log is used that can be replayed after a power loss event.
In any case, ZFS should be fine as long as REQ_PREFLUSH is working properly. You can read a little about that here:
https://github.com/openzfs/zfs/blob/453c63e9b74cea42d45e0bd3...
https://elixir.bootlin.com/linux/v4.18/source/include/linux/...
Found it kind of answered in the side thread: https://mobile.twitter.com/marcan42/status/14942278033275985...
In short - no, you'll still see corruption.
No, you won't see corruption on ZFS. Cutting power to the drive is always safe, you can slice a SATA cable with a guillotine if you want, you'll always see a consistent state of the filesystem. ZFS transactions are entirely atomic.
ZFS (and btrfs) is not "journaled", it's copy-on-write.
2 replies →
Filesystem corruption on ZFS would indicate that REQ_PREFLUSH is not being implemented correctly by either the hardware or the device driver.
For some here comparing and contrasting both documentation ambiguity and fsync behaviour between OSX and Linux, these two are probably useful:
"Linux Fsync Issue for Buffered IO and Its Preliminary Fix for PostgreSQL"
https://news.ycombinator.com/item?id=30131165
Wonder whether the AWS Mac EC2 instance types are affected too, anyone know?
You get an OS drive backed by EBS on those, through the AWS Nitro System.
As such, they share the same storage infrastructure as other EC2 instances.
Do you really think those machines are just plugged into the mains socket in Amazon's data centers?
Yes. And the cheap instances use shorter leads that technicians might trip over at any given moment.
1 reply →
> Of course, in normal usage, this is basically never an issue on laptops; given the right software hooks, they should never run out of power before the OS has a chance to issue a disk flush command
Unless of course your kernel panics. (Although there may still be a best-effort flush here, so it probably depends on how exactly it dies)
But since I only use Mac desktops.....
I guess a UPS powerbackup would be useful. Laptops basically have built-in UPS which is perhaps why Apple has gone in that direction. I wonder if their high-end desktops with Apple Silicon will do something different there.
3 replies →
Amusingly XNU adopted this behavior because it's what Linux did in the early 2000s and people complained that fsync was too slow without it.
That's interesting! I'd be fascinated to know what the underlying cause is – these full-sync numbers are amazingly low.
I dug a bit further and the NVMe controller is doing about 6.2MB/s of DRAM reads and 10MB/s of DRAM writes while doing a flush loop like this (which it isn't doing with the same traffic sans the flushes). I wonder if it's doing something dumb like linear scanning a cache hash table to find things to flush... or maybe something with bad cache locality?
I'm pretty sure, whatever it is, Apple could fix it in a firmware update.
> fsync() will both flush writes to the drive, and ask it to flush its write cache to stable storage.
Can someone explain what "flushing write cache to stable storage" means? Isn't that the same as "writes to the drive". I am obviously not well versed in this area. Also what is stable storage? Never heard that term before.
SSDs and other storage drives have two layers (or more). The last layer is stable storage (= when you disconnect power no data is lost or corrupted). When you write to such a device your writes are first made in an earlier layer that is more like your computer’s main memory than actual storage (when you lose power your data is gone or corrupted). Only after time or when the cache is full an actual persistent write is made.
Interesting post, but oh my god twitter is a garbage platform for what should be a blog post.
It's still not clear why Apple SSD so slow. Surely there's more to it. May be other SSDs are cheating in firmware? Or may be it's just bug in Apple firmware? I'm really interested if there will be follow ups on Apple side.
Since this design is inherited from iDevices, my guess is they never bothered to optimize this command since software on a battery-powered device would almost never need to issue it. It should be something they can improve in firmware.
From my understanding, the thing that’s slow is writing data to “permanent storage” (aka the layer under all the caching).
Some storage tech is just slow at that, and manufacturers muddy the water by rating some (SSDs|Micro SDs|whatever) in GB/s overall when much of those big numbers are a combination of caches and trickery.
I would not be surprised if Apple is using a tech that just has slow write speeds in trade for fast read speeds since most Apple users will be happy with faster read speeds.
Maybe it allow to design a device with less power consumption.
I'm not here to defend apple, but if you have a desktop and you don't want to lose data then get a UPS. Proper write handling on the disk wont help if you haven't saved your doc in ten minutes.
I think most SSD have dram cache on board. Could they design issue here be Apple doesn't have that and instead using system RAM as SSD dram cache?
They do use system RAM as cache, but that has no effect on performance. If anything it should be way faster than the puny RAM cache chips on typical SSDs. It doesn't explain the slow flush perf.
Afaik if you fsync an SSD with dram cache it won't hit the NAND cells. Those SSD do have some way to flush before they lose juice though.
This was for enterprise SSD though a few years back.
1 reply →
Is it possible to quantify how likely you are to hit a data integrity issue because of this?
maybe apple Nvmes have some sort of short time battery/capacitor gives it a time to finish the write once the power lost?
This is such an ugly hack...
Honestly I don’t know. The order-of-magnitude performance difference in deferring the flush feels worth it to me if the risk is mitigated to sudden power loss.
I would think when the last of Apple’s hardware moves to ARM they’ll ensure there’s enough onboard battery to ensure the flushes happen reliably across form factors even if there’s a power cut.
If anything, now that the reason for the performance difference has been identified, I’d hope to see numbers for Linux and Windows storage access come up to par with these numbers as they go down this road too (e.g. via the NVME flush toggle mentioned in the article).
Yeah. If the same thing happens to a brand-less garbage SSD you purchased from Aliexpress it's clearly cheat and plainly malicious and incompetence, but the Apple tag certainly made us believe there is a second reason.
Trading correctness for performance without shouting at the users "YOUR DATA IS NOT SAFE WHEN YOU DO THIS" multiple times a day in a storage is benchmark-snake-oil. Period.
2 replies →
>If anything, now that the reason for the performance difference has been identified
That's not the reason for the M1 performance differences (as a CPU).
Just for the disk writing (which isn't the fastest around to begin with anyway).
2 replies →
tl;dr OSX handles fsync() the way Linux used to, by not flushing to hardware
For their problem, they can easily solve it by replacing the SSD. Then fsync will be normal speed.
Apple doesn’t care about on-disk data integrity.
They just direct people to use Time Machine or iCloud, then look quizzically at you when have issue with writing off lost hours of work as cost of doing business.
Click-bait title again.
“You can lose some of your file changes in case of hard-reboot” is more correct.
It was a given truth for me all the time and I can tolerate some data losses if power was accidentally turned off for my desktop, or if OS panicked (it happens ~ once per year to me).
If this is a price for a 1000x speed increase - I’m more than happy they have implemented it this way.
You can lose some file changes even after asking the OS to make sure they don't get lost, the normal way.
That's a problem. It means e.g. transactional databases (which cannot afford to lose data like that) have a huge performance hit on these machines, since they have to use F_FULLFSYNC. And since that "no really, save my data" feature is not the standard fsync(), it means any portable software compiled for Linux will be safe, but will be unsafe on macOS, by default. That is a significant gotcha.
The question is why do other NVMe manufacturers not have such a performance penalty? 10x is fine; 1000x is not. This is something Apple should fix. It's a firmware problem.
No, it’s not a problem, it is expected. If you are running a transactional database on your desktop - at least add a UPS to your system.
23 replies →
The problem isn't that the default case is unsafe -- the problem is that the safe case is so extremely slow.
I guess I am old but the assumption I live by is that if power is suddenly cut from a computer - no matter desktop or laptop - it can damage the FS and/or cause data loss.
For any mission critical stuff, I have it behind a UPS.
At least your thinking is old. Modern filesystems and databases are designed to prevent data loss in that scenario.
The last time I saw a modern filesystem eat itself on sudden power loss was when I was evaluating btrfs in a datacenter setting, and that absolutely told me it was not a reliable FS and we went with something else. I've never seen it happen with ext4 or XFS (configured properly) in over a decade, assuming the underlying storage is well-behaved.
OTOH, I've seen cases of e.g. data in files being replaced by zeroes and applications crashing due to that (it's pretty common that zsh complains about .zsh_history being corrupted after a crash due to a trailing block of zeroes). This happens when filesystems are mounted with metadata journaling but no data journaling. If you use data journaling (or a filesystem designed to inherently avoid this, e.g. COW cases), that situation can't happen either. Most databases would be designed to gracefully handle this kind of situation without requiring systemwide data journaling though. That's a tradeoff that is available to the user depending on their specific use case and whether the applications are designed with that in mind or not.
1 reply →
Modern filesystems are designed so that you can unplug the hard drive and it will not be in a corrupted state.
UPSes can and do fail.
No, modern filesystems aren't expected to be corrupted by sudden power loss, and "put it behind a ups" assumes that it's impossible for a UPS to fail.
I've been using Macs (both desktop and laptops) since I have memory. I've had the M1 since launch day, and I use it all day, both for work and personal use.
Why this never happened to me? Why I don't know anyone which had this problem? Why nobody is complaining as it happened with the previous gen keyboards?
I think we might be missing something in this analysis. I don't think Apple engineers are idiots.
Most people don't unplug their Mac Mini in the middle of working, and most users who do lose data after that happens would just think it's normal and not realize there is an underlying problem and modern OSes aren't supposed to do that.
I've seen APFS filesystems eat themselves in production (and had to do data recovery), twice. Apple don't have a perfect data integrity track record.
On laptop, you would get data loss / corruption on sudden power loss. This is rare. With "flush to storage device's RAM", even a kernel panic would not lose data if you let the storage device flush to flash without power loss.