← Back to context

Comment by oxplot

10 years ago

Isn't block level duplication/checksumming like RAID supposed to solve this hardware unreliability? I understand that by default RAID is not used on end user desktops.

AFAIK, most RAID systems do not have checksums. I may be wrong, but I think RAID5/RAID6 even amplifies error frequency.

It gets more "fun" when you consider many (most? all?) hard disks can get corrupted without checksum failures.

Layer "violators" like ZFS and btrfs do have checksums.

Maybe conventional block / filesystem layering itself is faulty.

  • The layers that ZFS violates were created years before the failure modes were well understood that filesystem-based checksums address. I'm not sure how you _can_ solve these issues without violating the old layers.

    In particular: checksumming blocks alongside the blocks themselves (as some controllers and logical volume managers do) handles corruption within blocks, but it cannot catch dropped or misdirected writes. You need the checksum stored elsewhere, where you have some idea what the data _should_ be. Once we (as an industry) learned that those failure modes do happen and are important to address, the old layers no longer made sense. (The claim about ZFS is misleading: ZFS _is_ thoughtfully layered -- it's just that the layers are different, and more appropriate given the better understanding of filesystem failure modes that people had when it was developed.)

  • Dragonflybsd's HAMMER is a non layer "violator" with checksums (not that I mind the violations, they're great).

  • Dragonflybsd's HAMMER is a non layer "violator" with checksums (not that I mind the violations, they're great)