← Back to context

Comment by ac29

6 years ago

Many of the advanced features aren't implemented yet though, like compression, encryption, snapshots, RAID5/6....

Compression and encryption have been implemented, but not snapshots and RAID5/6.

why would you want to embed raid5/6 in the filesystem layer? Linux has battle-tested mdraid for this, I'm not going to trust a new filesystem's own implementation over it.

Same for encryption, there are already existing crypto layers both on the block and filesystem (as an overlay) level.

  • Because the FS can be deeply integrated with the RAID implementation. With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other. With ZFS for example, there is a checksum stored with the data, so when you read, zfs will check the data on both and pick the correct one. It will also overwrite the incorrect version with the correct one, and log the error. It's the same kind of story with encryption, if its built in you can do things like incremental backups of an encrypted drive, without ever decrypting it on the target.

    • > when you read, zfs will check the data on both and pick the correct one.

      Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.

      > there's no way for the fs to tell which is correct

      This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.

      This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.

      13 replies →

    • > With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other.

      That's problem only with RAID1, only when copies=2 (granted, most often used case) and only when the underlying device cannot report which sector has gone bad.

  • why would you want to embed raid5/6 in the filesystem layer?

    There are valid reasons, most having to do with filesystem usage and optimization. Off the top of my head:

    - more efficient re-syncs after failure (don't need to re-sync every block, only the blocks that were in use on the failed disk)

    - can reconstruct data not only on disk self-reporting, but also on filesystem metadata errors (CRC errors, inconsistent dentries)

    - different RAID profiles for different parts of the filesystem (think: parity raid for large files, raid10 for database files, no raid for tmp, N raid1 copies for filesystem metadata)

    and for filesystem encryption:

    - CBC ciphers have a common weakness: the block size is constant. If you use FS-object encryption instead of whole-FS encryption, the block size, offset and even the encryption keys can be varied across the disk.

  • I think to even call volume management a "layer" as though traditional storage was designed from first principles, is a mistake.

    Volume management is a just a hack. We had all of these single-disk filesystems, but single disks were too small. So volume management was invented to present the illusion (in other words, lie) that they were still on single disks.

    If you replace "disk" with "DIMM", it's immediately obvious that volume management is ridiculous. When you add a DIMM to a machine, it just works. There's no volume management for DIMMs.

    • Indeed there is no volume management for RAM. You have to reboot to rebuild the memory layout! RAM is higher in the caching hierarchy and can be rebuilt at smaller cost. You can't resize RAM while keeping data because nobody bothered to introduce volume management for RAM.

      Storage is at the bottom of the caching hierarchy where people get inventive to avoid rebuilding. Rebuilding would be really costly there. Hence we use volume management to spare us the cost of rebuilding.

      RAM also tends to have uniform performance. Which is not true for disk storage. So while you don't usually want to control data placement in RAM, you very much want to control what data goes on what disk. So the analogy confuses concepts rather than illuminating commonalities.

      1 reply →

  • > why would you want to embed raid5/6 in the filesystem layer?

    One of the creators of ZFS, Jess Bonwick, explained it in 2007:

    > While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.

    * https://blogs.oracle.com/bonwick/rampant-layering-violation

  • It's not about ZFS. It's about CoW filesystems in general; since they offer functionalities beyond the FS layer, they are both filesystems and logical volume managers.

  • Why does ZFS do RAIDZ in the filesystem layer?

    • It doesn't.

      RAIDZ is part of the VDEV (Virtual Device) layer. Layered on top of this is the ZIO (ZFS I/O layer). Together, these form the SPA (Storage Pool Allocator).

      On top of this layer we have the ARC, L2ARC and ZIL. (Adaptive Replacement Caches and ZFS Intent Log).

      Then on top of this layer we have the DMU (Data Management Unit), and then on top of that we have the DSL (Dataset and Snapshot Layer). Together, the SPA and DSL layers implement the Meta-Object Set layer, which in turn provides the Object Set layer. These implement the primitives for building a filesystem and the various file types it can store (directories, files, symlinks, devices etc.) along with the ZPL and ZAP layers (ZFS POSIX Layer and ZFS Attribute Processor), which hook into the VFS.

      ZFS isn't just a filesystem. It contains as many, if not more, levels of layering than any RAID and volume management setup composed of separate parts like mdraid+LVM or similar, but much better integrated with each other.

      It can also store stuff that isn't a filesystem. ZVOLs are fixed size storage presented as block devices. You could potentially write additional storage facilities yourself as extensions, e.g. an object storage layer.