← Back to context

Comment by benlwalker

4 years ago

A flush command only guarantees, upon completion, that all writes COMPLETED prior to submission of the flush are non-volatile. Not all previously sent writes. NVMe base specification 2.0b section 7.1.

That's a very important distinction. You can't assume just because a write completed before the flush that it's actually durable. Only if it completed before you sent the flush.

I'm not very confident that software is actually getting this right all that often, although it probably is in this fsync test.

Is there a separate barrier command so you don't have to track all the writes individually in software?

  • Nope. The reason this is so complex is that these devices are actually highly parallel machines with multiple queues accepting commands. It's quite difficult to even define "before" in terms of command sequence. For example, if you have a device with two hardware queues for submitting commands and a software thread for each, if you submit a flush on one queue, which commands on the other queue does it affect?

    Or what if the device issues a pci write to the completion entry that passes a flush command being submitted on the wire?

    I think the only interpretation that makes sense is from the perspective of a single software thread. If that particular thread has seen the completion via any mechanism and then that thread issues the flush, then you know the write is durable. Other than that, the device makes no promises.