Comment by geocar

4 years ago

> When building databases, we care about durability, so database authors are usually well aware that you _have_ to use `F_FULLSYNC` for safety. The fact that `F_FULLSYNC` isn't safe means that you cannot write a transactional database on Mac, it is also a surprise to me.

> Without that, there are no way to ensure durable writes and you might get data loss or data corruption.

No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).

As soon as you have two database servers, you're in a much better shape, and many databases like to try and use fsync() as a barrier to that replication, but this is a waste of time because your chances of a single hardware failure remain the same -- the only thing that really matters is that 1/2 is smaller than 1/1.

So okay, maybe you're not trying to protect against all hardware failure, or even just the flash failure (it will fail when it fails! better to have two nvme boards than one!) but maybe just some failure -- like a power failure, but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability because that's not the failure you're trying to protect against.

What does fsync() actually protect against? Well, sometimes that battery fails, or that capacitor blows: The hardware needed to write data to a spinning platter of metal and rust used to have a lot more failure points than today's solid state, and in those days, maybe it made some sense to add a system call instead of adding more hardware, but modern systems aren't like that: It is almost always cheaper in the long run to just buy two than to try and squeeze a little more edge out of one, but maybe, if there's a case where fsync() helps today, it's a situation where that isn't true -- but even that is a long way from you need fsync() to have durable writes and avoid data loss or corruption.

> No, not without that. Even with that, you can't have durable writes; Not on a mac, or linux or anywhere else, if you are worried about fsync()/fcntl+F_FULLSYNC because they do nothing to protect against hardware failure: The only thing that does is shipping the data someplace else (and depending on the criticality of the data, possibly quite far).

"The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.

Of course fsync ensures durable writes on systems like Linux with drives that honor FUA. The reliability of the device and stack in question is implied in this and anybody who talks about data integrity understands that. This is how you can calculate and manage error rates of your system.

  • > "The sun might explode so nothing guarantees integrity", come on, get real. This is pointless nitpicking.

    I think most people understand that there is a huge difference between the sun exploding and a single hardware failure.

    If you really don't understand that, I have no idea what to say.

    > Of course fsync ensures durable writes on systems like Linux with drives that honor FUA

    No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.

    • What is the difference in the context of your comment? The likelihood of the risk, and nothing else. So what is the exact magic amount of risk that makes one thing durable and another not, and who made you the arbiter of this?

      > No it does not. The drive can still fail after you write() and nobody will care how often you called fsync(). The only thing that can help is writing it more than once.

      It does to anybody who actually understands these definitions. It is durable according to the design (i.e., UBER rates) of your system. That's what it means, that's always what it meant. If you really don't understand that, I have no idea what to say.

      > The only thing that can help is writing it more than once.

      This just shows a fundamental misunderstanding. You achieve a desired uncorrected error rate by looking at the risks and designing parts and redundancy and error correction appropriately. The reliability of one drive/system might be greater than two less reliable ones, so "writing it more than once" is not only not the only thing that can help, it doesn't necessarily achieve the required durability.

      4 replies →

    • Say you have mirrored devices. Or RAID-5, whatever. Say the devices don't lie about flushing caches. And you fsync(), and then power fails, and on the way back up you find data loss or worse, data corruption. The devices didn't fail. The OS did.

      One need not even assume no device failure, since that's the point of RAID: to make up for some not-insignificant device failure rate. We need only assume that not too many devices fail at the same time. A pretty reasonable assumption. One relied upon all over the world, across many data centers.

"but guess what: We just need to put a big beefy capacitor on the board, or a battery someplace to protect against that. We don't need to write the flash blocks and read them back before returning from fsync() to get reliability"

I believe drives that do have capacitors are aware of it and return immediately from fsync() without writing to flash. Thats the point of this API

Since neither Macs nor any other laptops have SSDs with capacitors, this point is kind of moot.

  • Erm. They absolutely do. Most laptops have batteries as well— including all of the ones that Apple makes.

    • I have at various points replaced or upgraded 15 NVME SSD's in desktops and laptops, and I have not seen a single one - could you please let me know where I can find a non-server SSD with capacitors that are large enough for it to flush data in case of a sudden power loss?

      Laptop batteries are irrelevant - battery failure, freezin or cutting power to the curcuitbord by holding the off buttons are the failrue modes you have to protect against.