← Back to context

Comment by ayende

4 years ago

When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.

Note that the man page for `F_FULLSYNC` itself doesn't mention that it is not reliable: https://developer.apple.com/library/archive/documentation/Sy...

Having a separate syscall is annoying, but workable. Having a scenario where we call flush and cannot ensure that this is the case is BAD. Note that handling flush failures is expected, but all databases require that flushing successfully will make the data durable.

Without that, there are no way to ensure durable writes and you might get data loss or data corruption.

I checked a few and they seem to do F_FULLFSYNC (sic), except MySQL, they deleted it to make it run faster:

https://github.com/mysql/mysql-server/commit/3cb16e9c3879d17...

  • Wow. Could this explain why we have a lot of problems with MySQL running on Mac OS with the databases randomly getting totally corrupted and basically needing to be restored from backup each time?

    At first glance, it seems to make sense - if someone shuts down while there is still uncommitted data because MySQL has tried a fsync(), it could leave the files on disk in a weird state when the power is cut. Am I missing something?

  • "the possible durability gain is slim to none. This also makes OS X behave similar to other platforms."

    You didn't report the full reasoning.

    • Maybe that's right, maybe it's not - impossible to tell from the snippet. I'm deeply suspicious when they start citing performance numbers on what is a ultimately an ordering change though.

> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.

The best the OS can do is to trust the device that the data was, indeed, written to durable storage. Unfortunately, many devices lie about that. If you do a `F_FULLSYNC`, you can say you did your best, but the data is out of your hands now.

  • You can always reset the device and read back the data to confirm.

    Sure, that will be slow, but there is a way!

    • Not sure. They can still cheat. You'd need to power them down, then back up again. If it's a soft reset, they can just read it from RAM.

      1 reply →

> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.

> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.

No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).

As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.

So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.

What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.

  • > No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).

    "The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.

    Of course fsync ensures durable writes on systems like Linux with drives that honor FUA. The reliability of the device and stack in question is implied in this and anybody who talks about data integrity understands that. This is how you can calculate and manage error rates of your system.

    • > "The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.

      I think most people understand that there is a huge difference between the sun exploding and a single hardware failure.

      If you really don't understand that, I have no idea what to say.

      > Of course fsync ensures durable writes on systems like Linux with drives that honor FUA

      No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.

      8 replies →

  • "but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability"

    I believe drives that do have capacitors are aware of it and return immediately from fsync() without writing to flash. Thats the point of this API

    Since neither Macs nor any other laptops have SSDs with capacitors, this point is kind of moot.

> The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.

Yeah you can definitely write a transactional database without having to rely on knowing you've flushed data to disk. Not only can you, but you surely have to otherwise you risk data corruption e.g. when there's a power-cut mid-write.

  • The whole point of transactional flush to disk is that you get confirmation that data is now safe from power loss. You don't get any guarantee because you _called_ flush. The guarantee comes from flush returning.