Comment by supermatt

4 years ago

This F_FULLFSYNC behaviour has been like this on OSX for as long as I can remember. It is a hint to ensures that the data in the write buffer has been flushed to stable storage - this is historically a limitation of fsync that is being accounted for - are you 1000% sure it does as you expect on other OSes?

POSIX spec says no: https://pubs.opengroup.org/onlinepubs/9699919799/functions/f...

Maybe unrealistic expectation for all OSes to behave like linux.

Maybe linux fsync is more like F_BARRIERFSYNC than F_FULLFSYNC. You can retry with those for your benchmarks.

Also note that 3rd party drives are known to ignore F_FULLFSYNC, which is why there is an approved list of drives for mac pros. This could explain why you are seeing different figures if you are supplying F_FULLFSYNC in your benchmarks using those 3rd party drives.

106 comments

supermatt

marcan_42 4 years ago

Yes. fsync() on Linux pushes down to stable storage, not just drive cache.

OpenBSD, though, apparently behaves like macOS. I'm not sure I like that.

monocasa 4 years ago
Last time I checked (which is a while at this point, pre SSD) nearly all consumer drives and even most enterprise drives would lie in response to commands to flush the drive cache. Working on a storage appliance at the time, the specifics of a major drive manufacturer's secret SCSI vendor page knock to actually flush their cache was one of the things on their deepest NDAs. Apparently ignoring cache flushing was so ubiquitous that any drive manufacturer looking to have correct semantics would take a beating in benchmarks and lose marketshare. : \
So, as of about 2014, any difference here not being backed by per manufacturer secret knocks or NDAed, one-off drive firmware was just a magic show, with perhaps Linux at least being able to say "hey, at least the kernel tried and it's not our fault". The cynic in me thinks that the BSDs continuing to define fsync() as only hitting the drive cache is to keep a semantically clean pathway for "actually flush" for storage appliance vendors to stick on the side of their kernels that they can't upstream because of the NDAs. A sort of dotted line around missing functionality that is obvious 'if you know to look for it'.
It wouldn't surprise me at all if Apple's NVME controller is the only drive you can easily put your hands on that actually does the correct things on flush, since they're pretty much the only ones without the perverse market pressure to intentionally not implement it correctly.
Since this is getting updoots: Sort of in defense of the drive manufacturers (or at least stating one of the defenses I heard), they try to spec out the capacitance on the drive so that when the controller gets a power loss NMI, they generally have enough time to flush then. That always seemed like a stretch for spinning rust (the drive motor itself was quite a chonker in the watt/ms range being talked about particularly considering seeks are in the 100ms range to start with, but also they have pretty big electrolytic caps on spinning rust so maybe they can go longer?), but this might be less of a white lie for SSDs. If they can stay up for 200ms after power loss, I can maybe see them being able to flush cache. Gods help those HMB drives though, I don't know how you'd guarantee access to the host memory used for cache on power loss without a full system approach to what power loss looks like.
- garaetjjte 4 years ago
  
  Flush with other vendors at least does something as they block for some time too, just not as long as Apple.
  Apple implementation is weird because actual amount of data written doesn't seem to affect flush time.
  
  4 replies →
- SamReidHughes 4 years ago
  
  On my benchmarking of some consumer HDD's, back in 2013 or so, the flush time was always what you'd expect based on the drive's RPM. I got no evidence the drive was lying to me. These were all 2.5" drives.
  My understanding was, the capacitor thing on HDD's is to ensure it completely writes out a whole sector, so it passes the checksum. I only heard the flush cache thing with respect to enterprise SSD's. But I haven't been staying on top of things.
  
  7 replies →
supermatt 4 years ago
So its basically implementation specific, and macOS has its own way of handling it.
That doesnt make it worse - in fact it permits the flexibility you are now struggling with.
edit: downvotes for truth? nice. go read the posix spec then come back and remove your downvotes...
- dathinab 4 years ago
  
  Probably more like downvoted because missing the point.
  Sure fsync allows that behavior, but also it's so widely misunderstood that a lot of programs which should do a "full" flush only do a fsync, including Benchmarks. In which case they are not comparable and doing so is cheating.
  But that's not the point!
  The point is that with the M1 Macs SSDs the performance with fully flushing to disk is abysmal bad.
  And as such any application with cares for data integrity and does a full flush can expect noticable performance degradation.
  The fact that Apple neither forces frequent full syncs or at least full syncs when a Application is closed doesn't make it better.
  Though it is also not surprising as it's not the first time Apple set things up under the assumption their hardware is unfailable.
  And maybe for a desktop focused high end designs where most devices sold are battery powered that is a reasonable design choice.
  
  20 replies →
- marcan_42 4 years ago
  
  What is worse is their NVMe controller having 50x worse flush performance than the competition.
  
  11 replies →
- rsync 4 years ago
  
  I downvoted you because you complained about your downvotes.
  
  1 reply →
ribit 4 years ago

Something that is not quite clear to me yet (I did read the discussion below, thank you Hector for indulging us, very informative): isn't the end behaviour up to the drive controller? That is, how can we be sure that Linux actually does push to the storage or is it possible that the controller cheats? For example, you mention the USB drive test on a Mac — how can we know that the USB stick controller actually does the full flush?
Regardless, I certainly agree that the performance hit seems excessive. Hopefully it's just an algorithm, issue and Apple can fix this with a software update.
shellac 4 years ago
*BSDs mostly followed this semantic, as I recall. Probably inherited from a common ancestor.
- mrjin 4 years ago
  
  MacOS was really just FreeBSD with a fancier UI. Not sure what is the behavior now, but I'm pretty sure FreeBSD behaved almost exactly the same as a power loss rendered my system unbootable over 10 years ago.
  
  7 replies →
olliej 4 years ago

Linux does that now. It didn't in the past (something like 2008), and I recall many people arguing about performance or similar at that time :D
CyberRabbi 4 years ago

I like that. Fsync() was designed with the block cache in mind. IMO how the underlying hardware handles durability is its own business. I think a hack to issue a “full fsync” when battery is below some threshold is a good compromise.

otterley 4 years ago

It's important to read the entire document including the notes, which informs the reader of a pretty clear intent (emphasis mine):

> The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.

This seems consistent with user expectations - fsync() completion should mean data is fully recorded and therefore power-cycle- or crash-safe.

formerly_proven 4 years ago
You are quoting the non-normative informative part. If _POSIX_SYNCHRONIZED_IO is not defined, your fsync can literally be this and still be compliant:
int fsync(int) {}
Quick Google search (maybe someone with a MBP can confirm) says that macOS doesn't purport to implement SIO.
- otterley 4 years ago
  
  That particular implementation seems inconsistent with the following requirement:
  > The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes.
  If I wrote that requirement in a classroom programming assignment and you presented me with that code, you'd get a failing grade. Similarly, if I were a product manager and put that in the spec and you submitted the above code, it wouldn't be merged.
  > You are quoting the non-normative informative part
  Indeed, I am! It is important. Context matters, both in law and in programming. As a legal analogy, if you study Supreme Court rulings, you will find that in addition to examining the text of legislation or regulatory rules, the court frequently looks to legislative history, including Congressional findings and statements by regulators and legislators in order to figure out how to best interpret the law - especially when the text is ambiguous.
  
  5 replies →
- cryptonector 4 years ago
  
  Since crashes and power failures are out of scope for POSIX, even F_FULLSYNC's behavior description would of necessity be informative rather than normative.
  But, the reality is that all operating systems provide some way to make writes to persistent storage complete, and to wait for them. All of them. It doesn't matter what POSIX says, or that it leaves crashes and power failure out of scope.
  POSIX's model is not a get-out-of-jail-free card for actual operating systems.

mmis1000 4 years ago

At least it is also implemented by windows, which cause apt-get in hyperv vm slower

And also unbearable slow for loopback device backed docker container in the vm due to double layer of cache. I just add eat-my-data happily because you can't save a half finished docker image anyway.

_vvhw 4 years ago

> Also note that 3rd party drives are known to ignore F_FULLFSYNC

SQLite, MySQL et al. [1] fall back to `fsync()` if F_FULLFSYNC fails, in order to cover this case of 3rd party or external drives.

[1] https://twitter.com/TigerBeetleDB/status/1422855270716293123

throwawaylinux 4 years ago

OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it? I don't have one at hand but IIRC it did.

At least the OSX man page admits to the detail.

The rationale in the POSIX document for a null implementation seems reasonable (or at least plausible), but it does not really seem to apply to general OSX systems at all. So even if they didn't define _POSIX_SYNCHRONIZED_IO it would be against the spirit of the specification.

I'm actually curious why they made fsync do anything at all though.

supermatt 4 years ago
> OSX defines _POSIX_SYNCHRONIZED_IO though, doesn't it?
Nope: https://opensource.apple.com/source/Libc/Libc-1439.40.11/inc...
- throwawaylinux 4 years ago
  
  > #define _POSIX_SYNCHRONIZED_IO (-1) /* [SIO] */
  
  5 replies →

mannykannot 4 years ago

OP appears to be giving useful information about OSX, regardless of what other OSes do.

simonh 4 years ago

The implication (in fact no, it's explicitly stated) is that this fsync() behaviour on OSX will be a surprise for developers working on cross platform code or coming from other OS's and will catch them out.
However if in fact it's quite common for other OS's to exhibit the same or similar behaviour (BSD for example does this too, which makes sense as OSX has a lot of BSD lineage), that argument of least surprise falls a bit flat.
That's not to say this is good behaviour, I think Linux does this right, the real issue is the appalling performance for flushing writes.

ryao 4 years ago

The POSIX specification requires data to be on stable storage following fsync. Anything less is broken behavior.

An fsync that does not require the completion of an IO barrier before returning is inherently broken. This would be REQ_PREFLUSH inside Linux.

clysm 4 years ago

> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system.
> fsync() might or might not actually cause data to be written where it is safe from a power failure.

cryptonector 4 years ago

How are you reading POSIX as "saying no"??

From that page:

  The fsync() function shall request that all data for
  the open file descriptor named by fildes is to be
  transferred to the storage device associated with the
  file described by fildes. The nature of the transfer
  is implementation-defined. The fsync() function shall
  not return until the system has completed that action
  or until an error is detected.

then:

  The fsync() function is intended to force a physical
  write of data from the buffer cache, and to assure
  that after a system crash or other failure that all
  data up to the time of the fsync() call is recorded
  on the disk. Since the concepts of "buffer cache",
  "system crash", "physical write", and "non-volatile
  storage" are not defined here, the wording has to be
  more abstract.

The only reason to doubt the clarity of the above is that POSIX does not consider crashes and power failures to be in scope. It says so right in the quoted text.

Crashes and power failures are just not part of the POSIX worldview, so in POSIX there can be no need for sync(2) or fsync(2), or fcntl(2) w/ F_FULLFSYNC! Why even bother having those system calls? Why even bother having the spec refer to the concept at all?

Well, the reality is that some allowance must be made for crashes and power failures, and that includes some mechanism for flushing caches all the way to persistent storage. POSIX is a standard that some real-life operating systems aim to meet, but those operating systems have to deal with crashes and power failures because those things happen in real life, and because their users want the operating systems to handle those events as gracefully as possible. Some data loss is always inescapable, but data corruption would be very bad, which is why filesystems and applications try to do things like write-ahead logging and so on.

That is why sync(2), fsync(2), fdatasync(2), and F_FULLFSYNC exist. It's why they [well, some of them] existed in Unix, it's why they still exist in Unix derivatives, it's why they exist in Unix-alike systems, it's why they exist in Windows and other not-remotely-POSIX operating systems, and it's why they exist in POSIX.

If they must exist in POSIX, then we should read the quoted and linked page, and it is pretty clear: "transferred to the storage device" and "intended to force a physical write" can only mean... what that says.

It would be fairly outrageous for an operating system to say that since crashes and power failures are outside the scope of POSIX, the operating system will not provide any way to save data persistently other than to shut down!

supermatt 4 years ago
> transferred to the storage device
MacOS does that.
> the fsync() function is intended to force a physical write of data from the buffer cache
If they define _POSIX_SYNCHRONIZED_IO, which they dont.
fsync wasnt defined as requiring a flush until version 5 of the spec. It was implemented in BSDs loooong before then. Apple introduced F_FULLFSYNC prior to fsync having that new definition.
I dont disagree with you, but it is what it is. History is a thing. Legacy support is a thing. Apple likely didnt want to change peoples expectations of the behaviour on OSX - they have their own implementation after all (which is well documented, lots of portable software and libs actively uses it, and its built in to the higher level APIs that Mac devs consume).
- cryptonector 4 years ago
  
  > > transferred to the storage device
  > MacOS does that.
  Depends on the definition of "storage device", I guess. If it's physical media, then OS X doesn't. If it's the controller, then OS X does. But since the intent is to have the data reach persistent storage, it has to be the physical media.
  My guess is that since people know all of this, they'll just keep working around it as they already do. Newbies to OS X development will get bitten unless they know what to look for.

acchow 4 years ago

Do you mean on Linux that calling fsync might not actually flush to the drive?

ecf 4 years ago

How many hundreds of millions of people use OSX over the years and never encountered any problems whatsoever?

This article is a non-issue, people just like to upvote Apple bashing.

meibo 4 years ago
If you need to run software/servers with any kind of data consistency/reliability on OS X this is definitely something you should be aware of and will be a footgun if you're used to Linux.
Macs in datacentres are becoming increasingly common for CI, MDM, etc.
- parkingrift 4 years ago
  
  I’d rather solve for redundant power than worry about this. It’s really only critical if you’re running a database. Who runs a database on macOS?
  
  9 replies →
- native_samples 4 years ago
  
  I've used Macs for years and never been aware of it.
  Note: the tweeter couldn't provoke actual problems under any sort of normal usage. To make data loss show up he had to use weird USB hacks. If you know you have a battery and can forcibly shut down the machine 'cleanly' it's not really clear what the need for a hard fsync is.
  "Macs in datacentres are becoming increasingly common for CI, MDM, etc."
  CI machines are the definition of disposable data. Nobody is running Oracle on macOS and Apple don't care about that market.
- shadowgovt 4 years ago
  
  These days, best practice for data consistency / reliability in that environment, IIUC, is to write to multiple redundant shards and checksum, not to assume any particular field-pattern spat at the hard drive will make for a reliability guarantee.
ClumsyPilot 4 years ago

"never encountered any problems whatsoever?"
And how do you know they didn't, did you do a poll?
How many people had random files dissapear or get corrupted or settings get reset and probzbly thought they must have done something wrong?