Comment by notacoward

6 years ago

> when you read, zfs will check the data on both and pick the correct one.

Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.

> there's no way for the fs to tell which is correct

This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.

This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.

13 comments

notacoward

mbreese 6 years ago

> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).

This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.

There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.

ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.

notacoward 6 years ago

> it is no way generalizable to all the filesystems Linux has to support
Why not? If it's a standard LVM API then it's far more general than sucking everything into one filesystem like ZFS did. Much of this block-mapping interface already exists, though I'm not sure whether it covers this specific use case.

throw0101a 6 years ago

> This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH.

At the time that ZFS was written (early 2000s) and released to the public (2006), this was not a thing and the idea was somewhat novel / 'controversial'. Jeff Bonwick, ZFS co-creator, lays out their thinking:

* https://blogs.oracle.com/bonwick/rampant-layering-violation

Remember: this was a time when Veritas Volume Manager (VxVM) and other software still ruled the enterprise world.

* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation

notacoward 6 years ago
I debated some of this with Bonwick (and Cantrill who really had no business being involved but he's pernicious that way) at the time. That blog post is, frankly, a bit misleading. The storage "stack" isn't really a stack. It's a DAG. Multiple kinds of devices, multiple filesystems plus raw block users (yes they still exist and sometimes even have reason to), multiple kinds of functionality in between. An LVM API allows some of this to have M users above and N providers below, for M+N total connections instead of M*N. To borrow Bonwick's own condescending turn of phrase, that's math. The "telescoping" he mentions works fine when your storage stack really is a stack, which might have made sense in a not-so-open Sun context, but in the broader world where multiple options are available at every level it's still bad engineering.
- throw0101a 6 years ago
  
  > ... but in the broader world where multiple options are available at every level it's still bad engineering.
  When Sun added ZFS to Solaris, they did not get rid of UFS and/or SVM, nor prevent Veritas from being installed. When FreeBSD added ZFS, they did not get rid of UFS or GEOM either.
  If an admin wanted or wants (or needs) to use the 'old' way of doing things they can.
- bcantrill 6 years ago
  
  Sorry, I'm pernicious in what way, exactly?
  
  1 reply →

rhinoceraptor 6 years ago

The fact that traditionally RAID, LVM, etc. are not part of the filesystem is just an accident of history. It's just that no one wanted to rewrite their single disk filesystems now that they needed to support multiple disks. And the fact that administering storage is so uniquely hard is a direct result of that.

notacoward 6 years ago
However it happened, modularity is still a good thing. It allows multiple filesystems (and other things that aren't quite filesystems) to take advantage of the same functionality, even concurrently, instead of each reinventing a slightly different and likely inferior wheel. It should not be abandoned lightly. Is "modularity bad" really the hill you want to defend?
- throw0101a 6 years ago
  
  > However it happened, modularity is still a good thing.
  It may be a good thing, and it may not. Linux has a bajillion file systems, some more useful than others, and that is unique in some ways.
  Solaris and other enterprise-y Unixes at the time only had one. Even the BSDs generally only have a few that they run on instead of ext2/3/4, XFS, ReiserFS (remember when that was going to take over?), btrfs, bcachefs, etc, etc, etc.
  At most, a company may have purchased a license for Veritas:
  * https://en.wikipedia.org/wiki/Veritas_Storage_Foundation
  By rolling everything together, you get ACID writes, atomic space-efficient low-overhead snapshots, storage pools, etc. All this just be removing one layer of indirection and doing some telescoping:
  * https://blogs.oracle.com/bonwick/rampant-layering-violation
  It's not "modularity bad", but that to achieve the same result someone would have had to write/expand a layer-to-layer API to achieve the same results, and no one did. Also, as a first-order estimate of complexity: how many lines of code (LoC) are there in mdraid/LVM/ext4 versus ZFS (or UFS+SVM on Solaris).
- rhinoceraptor 6 years ago
  
  Other than esoteric high performance use cases, I'm not really sure why you would really need a plethora of filesystems. And the list of them that can be actually trusted is very short.
  
  1 reply →

boomer_joe 6 years ago

>I work on one of the four or five largest storage systems in the world

What would you recommend over zfs for small-scale storage servers? XFS with mdraid?

I'd also love to hear your opinion on the Reiser5 paper.