← Back to context

Comment by xenadu02

4 years ago

My script repeatedly writes a counter value "lines=$counter" to a file, then calls fcntl() with F_FULLFSYNC against that file descriptor which on macOS ends up doing an NVMe FLUSH to the drive (after sending in-memory buffers and filesystem metadata to the drive).

Once those calls succeed it increments the counter and tries again.

As soon as the write() or fcntl() fail it prints the last successfully written counter value which can be checked against the contents of the file. Remember: the semantics of the API and the NVMe spec require that a successful return from fcntl(fd, F_FULLFSYNC) on macOS require that data is durable at that point no matter what filesystem metadata OR drive internal metadata is needed to make that happen.

In my test while the script is looping doing that as fast as possible I yank the TB cable. The enclosure is bus powered so it is an unceremonious disconnect and power off.

Two of the tested drives always matched up: whatever the counter was when write()+fcntl() succeeded is what I read back from the file.

Two of the drives sometimes failed by reporting counter values < the most recent successful value, meaning the write()+ fcntl() reported success but upon remount the data was gone.

Anytime a drive reported a counter value +1 from what was expected I still counted at that as success... after all there's a race window where the fcntl() has succeeded but the kernel hasn't gotten the ACK yet. If disconnect happens at that moment fcntl() will report failure even though it succeeded. No data is lost so that's not a "real" error.

On very recent Linux kernels you can open the raw NVMe device and use the NVMe pass thru ioctl to directly send NVMe commands (or you can use SPDK on essentially any Linux kernel) and bypass whatever the fsync implementation is doing. That gives a much more direct test of the hardware (and some vendors have automated tests that do this with SPDK and ip power switches!). There's a bunch of complexity around atomicity of operations during power failure beyond just flush that have to get verified.

But the way you tested is almost certainly valid.

Is it possible the next write was incomplete when the power cut out? Wouldn't this depend on how updates to file data are managed by the filesystem? The size and alignment of disk and filesystem data & metadata blocks?

  • Yes, kinda. If the drive completes the flush but gets disconnected before the kernel can read the ack then I can get an error from fcntl(). In theory it's possible I could get an error from write() even though it succeeded but I don't know if that is possible in practice.

    In any case the file's last line will have a counter value +1 compared to what I expected. That is counted as a success.

    Failure is only when a line was written to the file with counter==N, fcntl(fd, F_FULLFSYNC, 1) reports success all the way back to userspace, yet the file has a value < N. This gives the drive a fairly big window to claim it finished flushing as the ack races back to userspace but even so two of the drives still failed. The SK Hynix Gold P31 sometimes lost multiple writes (N-2) meaning two flush cycles were not enough.

This seems like it would only work with with an external enclosure setup. I wonder if a test could be performed in the usual NVMe slot.

Of course, it seems it would be much harder to pull main power for the entire PC. I'm not sure how you'd do that - maybe high speed camera, high refresh monitor to capture the last output counter? Still no guarantee I'm afraid.

  • If you have a host system that has reasonable PCIe hotplug support and won't panic at a device dropping off the bus, then you can just use a riser card that can control power provided over a PCIe slot.

    Quarch makes power injection fixtures for basically all drive connectors, to be paired with their programmable power supply for power loss testing or voltage margin testing (quite important when M.2 drives pull 2.5+A over the 3.3V rail and still want <5% voltage drop).

  • There's plenty of network controlled power outlets. Either enterprise/rackmount PDUs, or consumer wifi outlets, or rig something up with a serial/parallel port and a relay. You'd use an always on test runner computer to control the power state.

    The computer under test would boot from PXE, on boot read from the drive and determine the last write, send that to the test runner for analysis, then begin the write sequence and report ASAP to the test runner at each flush. The test runner turns the power off at random, waits a minute (or 10 seconds, whatever) and turns it back on and starts again.

    In a well functioning system, you should often get back the last reported successful write, and sometimes get back a write beyond the last reported write (two generals and all), but never a write before the last reported write. You can't use this testing to prove correct flushing, but if you run for a week and it doesn't fail once, it's probably likely not to lie.

    I haven't evaluated the code, but here's a post from 2005 with a link to code that probably works for this. (Note: this doesn't include the pxe booting or the power control... This just covers the what to write to the disk, how to report it to another machine, and how to check the results after a power cycle)

    https://brad.livejournal.com/2116715.html?

  • Put the usual NVMe drive in an external enclosure which is what the OP did.