Disks Lie: Building a WAL that actually survives

4 months ago (blog.canoozie.net)

I’ve seen disks do off track writes, dropped writes due to write channel failures, and dropped writes due to the media having been literally scrubbed off the platter previously. You need LBA seeded CRC to catch these failures along with a number of other checks. I get excited when people write about this in the industry. They’re extremely interesting failure modes that I’ve been lucky enough to have been exposed to, at volume, for a large fraction of my career.

People consistently underestimate the many ways in which storage can and will fail in the wild.

The most vexing storage failure is phantom writes. A disk read returns a "valid" page, just not the last written/fsync-ed version of that page. Reliably detecting this case is very expensive, particularly on large storage volumes, so it is rarely done for storage where performance is paramount.

  • Not that uncommon failure mode for some SSDs, unclean shutdown is like a dice roll for some of them: maybe you get what you wrote five seconds ago, maybe you get a snapshot of a couple hours ago.

    • Early SSDs were particularly prone to phantom writes due to firmware bugs. Still have scars from the many creative ways in which early SSDs would routinely fail.

      1 reply →

This looks AI-generated, including the linked code. That explains why the .zig-cache directory and the binary is checked into Git, why there's redundant commenting, and why the README has that bold, bullet point and headers style that is typical of AI.

If you can't be bothered to write it, I can't be bothered to read it.

  • The front page this weekend has been full of this stuff. If there’s a hint of clickbait about the title, it’s almost a forgone conclusion you’ll see all the other LLM tics, too.

    These do not make the writing better! They obscure whatever the insight is behind LinkedIn-engagement tricks and turns of phrase that obfuscate rather than clarify.

    I’ll keep flagging and see if the community ends up agreeing with me, but this is making more and more of my hn experience disappointing instead of delightful.

I worked with a greybeard that instilled in me that when we were about to do some RAID maintenance that we would always run sync twice. The second to make sure it immediately returns. And I added a third for my own anxiety.

> Submit the write to the primary file

> Link fsync to that write (IOSQE_IO_LINK)

> The fsync's completion queue entry only arrives after the write completes

> Repeat for secondary file

Wait, so the OS can re-order the fsync() to happen before the write request it is supposed to be syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

> O_DSYNC: Synchronous writes. Don't return from write() until the data is actually stable on the disk.

If you call fsync() this isn't needed correct? And if you use this, then fsync() isn't needed right?

  • > Wait, so the OS can re-order the fsync() to happen before the write request it is supposed to be syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

    This is an io_uring-specific thing. It doesn't guarantee any ordering between operations submitted at the same time, unless you explicitly ask it to with the `IOSQE_IO_LINK` they mentioned.

    Otherwise it's as if you called write() from one thread and fsync() from another, before waiting for the write() call to return. That obviously defeats the point of using fsync() so you wouldn't do that.

    > If you call fsync(), [O_DSYNC] isn't needed correct? And if you use [O_DSYNC], then fsync() isn't needed right?

    I believe you're right.

This article is pretty low quality. It's an important and interesting topic and the article is mostly right but it's not clear enough to rely on.

The OS page cache is not a "problem"; it's a basic feature with well-documented properties that you need to learn if you want to persist data. The writing style seems off in general (e.g. "you're lying to yourself").

AFAIK fsync is the best practice not O_DIRECT + O_DSYNC. The article mentions O_DSYNC in some places and fsync in others which is confusing. You don't need both.

Personally I would prefer to use the filesystem (RAID or ditto) to handle latent sector errors (LSEs) rather than duplicating files at the app level. A case could be made for dual WALs if you don't know or control what filesystem will be used.

Due to the page cache, attempting to verify writes by reading the data back won't verify anything. Maaaybe this will work when using O_DIRECT.

https://en.wikipedia.org/wiki/Data_Integrity_Field

This, along with RAID-1, is probably sufficient to catch the majority of errors. But realize that these are just probabilities - if the failure can happen on the first drive, it can also happen on the second. A merkle tree is commonly used to also protect against these scenarios.

Notice that using something like RAID-5 can result in data corruption migrating throughout the stripe when using certain write algorithms

  • The paranoid would also follow the write with a read command, setting the SCSI FUA (forced unit access) bit, requiring the disk to read from the physical media, and confirming the data is really written to that rotating rust. Trying to do similar in SATA or with NVMe drives might be more complicated, or maybe impossible. That’s the method to ensure your data is actually written to viable media and can be subsequently read.

I thought an fsync on the containing directories of each of the logs was needed to ensure the that newly created files were durably present in the directories.

  • Right, you do need to fsync when creating new files to ensure the directory entry is durable. However, WAL files are typically created once and then appended to for their lifetime, so the directory fsync is only needed at file creation time, not during normal operations.

> Conclusion

> A production-grade WAL isn't just code, it's a contract.

I hate that I'm now suspicious of this formulation.