Comment by wheybags
6 years ago
Because the FS can be deeply integrated with the RAID implementation. With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other. With ZFS for example, there is a checksum stored with the data, so when you read, zfs will check the data on both and pick the correct one. It will also overwrite the incorrect version with the correct one, and log the error. It's the same kind of story with encryption, if its built in you can do things like incremental backups of an encrypted drive, without ever decrypting it on the target.
> when you read, zfs will check the data on both and pick the correct one.
Are you sure about that? Always reading both doubles read I/O, and benchmarks show no such effect.
> there's no way for the fs to tell which is correct
This is not an immutable fact that precludes keeping the RAID implementation separate. If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID). I've seen it detect and fix bad blocks, many times. It works, even with separate layers.
This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH. Understanding other complex subsystems is hard, and it's more fun to write new code instead.
> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).
This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.
There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.
ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.
> it is no way generalizable to all the filesystems Linux has to support
Why not? If it's a standard LVM API then it's far more general than sucking everything into one filesystem like ZFS did. Much of this block-mapping interface already exists, though I'm not sure whether it covers this specific use case.
> This supposed need for ZFS to absorb all RAID/LVM/page-cache behavior into itself is a myth; what really happened is good old-fashioned NIH.
At the time that ZFS was written (early 2000s) and released to the public (2006), this was not a thing and the idea was somewhat novel / 'controversial'. Jeff Bonwick, ZFS co-creator, lays out their thinking:
* https://blogs.oracle.com/bonwick/rampant-layering-violation
Remember: this was a time when Veritas Volume Manager (VxVM) and other software still ruled the enterprise world.
* https://en.wikipedia.org/wiki/Veritas_Storage_Foundation
I debated some of this with Bonwick (and Cantrill who really had no business being involved but he's pernicious that way) at the time. That blog post is, frankly, a bit misleading. The storage "stack" isn't really a stack. It's a DAG. Multiple kinds of devices, multiple filesystems plus raw block users (yes they still exist and sometimes even have reason to), multiple kinds of functionality in between. An LVM API allows some of this to have M users above and N providers below, for M+N total connections instead of M*N. To borrow Bonwick's own condescending turn of phrase, that's math. The "telescoping" he mentions works fine when your storage stack really is a stack, which might have made sense in a not-so-open Sun context, but in the broader world where multiple options are available at every level it's still bad engineering.
3 replies →
The fact that traditionally RAID, LVM, etc. are not part of the filesystem is just an accident of history. It's just that no one wanted to rewrite their single disk filesystems now that they needed to support multiple disks. And the fact that administering storage is so uniquely hard is a direct result of that.
However it happened, modularity is still a good thing. It allows multiple filesystems (and other things that aren't quite filesystems) to take advantage of the same functionality, even concurrently, instead of each reinventing a slightly different and likely inferior wheel. It should not be abandoned lightly. Is "modularity bad" really the hill you want to defend?
3 replies →
>I work on one of the four or five largest storage systems in the world
What would you recommend over zfs for small-scale storage servers? XFS with mdraid?
I'd also love to hear your opinion on the Reiser5 paper.
> With a normal RAID, if the data at some address is different between the two disks, there's no way for the fs to tell which is correct, because the RAID code essentially just picks one, it can't even see the other.
That's problem only with RAID1, only when copies=2 (granted, most often used case) and only when the underlying device cannot report which sector has gone bad.