Comment by rossmohax
4 years ago
Docs [1] suggests that even F_FULLFSYNC might not be enough. Quote:
> Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.
[1] https://developer.apple.com/documentation/xcode/reducing-dis...
When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Note that the man page for `F_FULLSYNC` itself doesn't mention that it is not reliable: https://developer.apple.com/library/archive/documentation/Sy...
Having a separate syscall is annoying, but workable. Having a scenario where we call flush and cannot ensure that this is the case is BAD. Note that handling flush failures is expected, but all databases require that flushing successfully will make the data durable.
Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
I checked a few and they seem to do F_FULLFSYNC (sic), except MySQL, they deleted it to make it run faster:
https://github.com/mysql/mysql-server/commit/3cb16e9c3879d17...
Oh MySQL, I’m a world turned upside down you are my North Star.
Wow. Could this explain why we have a lot of problems with MySQL running on Mac OS with the databases randomly getting totally corrupted and basically needing to be restored from backup each time?
At first glance, it seems to make sense - if someone shuts down while there is still uncommitted data because MySQL has tried a fsync(), it could leave the files on disk in a weird state when the power is cut. Am I missing something?
"the possible durability gain is slim to none. This also makes OS X behave similar to other platforms."
You didn't report the full reasoning.
1 reply →
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
The best the OS can do is to trust the device that the data was, indeed, written to durable storage. Unfortunately, many devices lie about that. If you do a `F_FULLSYNC`, you can say you did your best, but the data is out of your hands now.
You can always reset the device and read back the data to confirm.
Sure, that will be slow, but there is a way!
2 replies →
> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.
No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.
So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.
What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.
> No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).
"The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.
Of course fsync ensures durable writes on systems like Linux with drives that honor FUA. The reliability of the device and stack in question is implied in this and anybody who talks about data integrity understands that. This is how you can calculate and manage error rates of your system.
9 replies →
"but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability"
I believe drives that do have capacitors are aware of it and return immediately from fsync() without writing to flash. Thats the point of this API
Since neither Macs nor any other laptops have SSDs with capacitors, this point is kind of moot.
2 replies →
"Silly wabbit, database trix are for servers!"
> The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.
Yeah you can definitely write a transactional database without having to rely on knowing you've flushed data to disk. Not only can you, but you surely have to otherwise you risk data corruption e.g. when there's a power-cut mid-write.
The whole point of transactional flush to disk is that you get confirmation that data is now safe from power loss. You don't get any guarantee because you _called_ flush. The guarantee comes from flush returning.
Lol, but hey, macs are not servers, so "hahah who cares!".