Comment by hardwaresofton
4 years ago
I've actually run into some data loss running simple stuff like pgbench on Hetzner due to this -- I ended up just turning off write-back caching at the device level for all the machines in my cluster:
https://vadosware.io/post/everything-ive-seen-on-optimizing-...
Granted I was doing something highly questionable (running postgres with fsync off on ZFS) It was very painful to get to the actual issue, but I'm glad I found out.
I've always wondered if it was worth pursuing to start a simple data product with tests like these on various cloud providers to know where these corners are and what you're really getting for the money (or lack thereof).
[EDIT] To save people some time (that post is long), the command to set the feature is the following:
nvme set-feature -f 6 -v 0 /dev/nvme0n1
The docs for `nvme` (nvme-cli package, if you're Ubuntu based) can be pieced together across some man pages:
https://man.archlinux.org/man/nvme.1
https://man.archlinux.org/man/nvme-set-feature.1.en
It's a bit hard to find all the NVMe features but 6 is the one for controlling write-back caching.
https://unix.stackexchange.com/questions/472211/list-feature...
I dont have ide in this machine but I found this in the source code [1] probably pointing to [2] and thanks for the tip!
[1]https://github.com/linux-nvme/nvme-cli/blob/master/nvme-prin...
[2]https://github.com/linux-nvme/libnvme/blob/master/src/nvme/t...
ah thanks this is perfect, saved those links!
Also: https://nvmexpress.org/developers/nvme-specification/
Unlike eg. ATA and SCSI, the NVMe specs are freely available to the public. They're a little more complicated to read now that the spec has been split into a few modules, but finding the descriptions of all the optional features isn't too hard.
3 replies →
From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.
If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
> From reading your vadosware.io notes, I'm intrigued that replacing fdatasync with fsync is supposed to make a difference to durability at the device level. Both functions are supposed to issue a FLUSH to the underlying device, after writing enough metadata that the file contents can be read back later.
Yeah I thought the same initially which is why I was super confused --
> If fsync works and fdatasync does not, that strongly suggests a kernel or filesystem bug in the implementation of fdatasync that should be fixed.
Gulp.
> That said, I looked at the logs you showed, and those "Bad Address" errors are the EFAULT error, which only occurs in buggy software, or some issue with memory-mapping. I don't think you can conclude that NVMe writes are going missing when the pg software is having EFAULTs, even if turning off the NVMe write cache makes those errors go away. It seems likely that that's just changing the timing of whatever is triggering the EFAULTs in pgbench.
It looks like I'm going to have to do some more experimentation on this -- maybe I'll get a fresh machine and try to reproduce this issue again.
What led me to NVMe as dropping write was the complete lack of errors on the pg and OS side (dmesg, etc).