← Back to context

Comment by consp

4 days ago

Blind question with no attempt to look it up: why don't filesystems do this? It won't work for most boot code but that is relatively easy to fix by plugging it in somewhere else.

Wrong layer.

SSDs know which blocks have been written to a lot, have been giving a lot of read errors before etc., and often even have heterogeneous storages (such as a bit of SLC for burst writing next to a bunch of MLC for density).

They can spend ECC bits much more efficiently with that information than a file system ever could, which usually sees the storage as a flat, linear array of blocks.

  • This is true, but nevertheless you cannot place your trust only in the manufacturer of the SSD/HDD, as I have seen enough cases when the SSD/HDD reports no errors, but nonetheless it returns corrupted data.

    For any important data you should have your own file hashes, for corruption detection, and you should add some form of redundancy for file repair, either with a specialized tool or simply by duplicating the file on separate storage media.

    A database with file hashes can also serve other purposes than corruption detection, e.g. it can be used to find duplicate data without physically accessing the archival storage media.

    • Verifying at higher layers can be ok (it's still not ideal!), but trying to actively fix things below that are broken usually quickly becomes a nightmare.

  • IMO it's exactly the right layer, just like for ECC memory.

    There's a lot of potential for errors when the storage controller processes and turns the data into analog magic to transmit it.

    In practice, this is a solved problem, but only until someone makes a mistake, then there will be a lot of trouble debugging it between the manufacturer certainly denying their mistake and people getting caught up on the usual suspects.

    Doing all the ECC stuff right on the CPU gives you all the benefits against bitrot and resilience against all errors in transmission for free.

    And if all things go just right we might even be getting better instruction support for ECC stuff. That'd be a nice bonus

    • > There's a lot of potential for errors when the storage controller processes and turns the data into analog magic to transmit it.

      That's a physical layer, and as such should obviously have end-to-end ECC appropriate to the task. But the error distribution shape is probably very different from that of bytes in NAND data at rest, which is different from that of DRAM and PCI again.

      For the same reason, IP does not do error correction, but rather relies on lower layers to present error-free datagram semantics to it: Ethernet, Wi-Fi, and (managed-spectrum) 5G all have dramatically different properties that higher layers have no business worrying about. And sticking with that example, once it becomes TCP's job to handle packet loss due to transmission errors (instead of just congestion), things go south pretty quickly.

      2 replies →

    • ECC memory modules don’t do their own very complicated remapping from linear addresses to physical blocks like SSDs do. ECC memory is also oriented toward fixing transient errors, not persistently bad physical blocks.

You can still do this for boot code if the error isn't significant enough to make all of the boot fail. The "fixing it by plugging it in somewhere else" could then also be simple enough to the point of being fully automated.

ZFS has "copies=2", but iirc there are no filesystems with support for single disk erasure codes, which is a huge shame because these can be several orders of magnitude more robust compared to a simple copy for the same space.

The filesystem doesn't have access to the right existing ECC data to be able to add a few bytes to do the job. It would need to store a whole extra copy.

There are potentially ways a filesystem could use heirarchical ECC to just store a small percentage extra, but it would be far from theoretically optimal and rely on the fact just a few logical blocks of the drive become unreadable, and those logical blocks aren't correlated in write time (which I imagine isn't true for most ssd firmware).

  • CD storage has an interesting take, the available sector size varies by use, i.e. audio or MPEG1 video (VideoCD) at 2352 data octets per sector (with two media level ECCs), actual data at 2048 octets per sector where the extra EDC/ECC can be exposed by reading "raw". I learned this the hard way with VideoPack's malformed VCD images, I wrote a tool to post-process the images to recreate the correct EDC/ECC per sector. Fun fact, ISO9660 stores file metadata simultaneously in big-endian and little form (AFAIR VP used to fluff that up too).

  • Reed Solomon codes, or forward error correction is what you’re discussing. All modern drives do it at low levels anyway.

    It would not be hard for a COW file system to use them, but it can easily get out of control paranoia wise. Ideally you’d need them for every bit of data, including metadata.

    That said, I did have a computer that randomly bit flipped when writing to storage sometimes (eventually traced it to an iffy power supply), and PAR (a type of reed solomon coding forward error correction library) worked great for getting a working backup off the machine. Every other thing I tried would end up with at least a couple bit flip errors per GB, which make it impossible.

You can, but only if your CPU is directly connected to a flash chip with no controller in the way. Linux calls it the mtd subsystem (memory technology device).