Comment by uniqueuid

5 days ago

It's good to see that they were pretty conservative about the expansion.

Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.

That said, there is one tiny caveat people should be aware of:

> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.

  • I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.

    For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.

    • You have a couple options:

      1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.

      2. Use send/receive inside the pool.

      Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

      1 reply →

    • Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.

      Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.

    • It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.

      Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.

Caveat is very much expected, you should expect ZFS features to not rewrite blocks. Changes to settings only apply to new data for example.

Yaeh it's a pretty huge caveat to be honest.

    Da1 Db1 Dc1 Pa1 Pb1
    Da2 Db2 Dc2 Pa2 Pb2
    Da3 Db3 Dc3 Pa3 Pb3
    ___ ___ ___ Pa4 Pb4

___ represents free space. After expansion by one disk you would logically expect something like:

    Da1 Db1 Dc1 Da2 Pa1 Pb1
    Db2 Dc2 Da3 Db3 Pa2 Pb2
    Dc3 ___ ___ ___ Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4

But as I understand it it would actually expand to:

    Da1 Db1 Dc1 Dd1 Pa1 Pb1
    Da2 Db2 Dc2 Dd2 Pa2 Pb2
    Da3 Db3 Dc3 Dd3 Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4

Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.

Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.

  • ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.

    The slides here explain how it works:

    https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

    Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

    • To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.

      I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.

      I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.

  • Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.

    You can see this in the presentation[1] slides[2].

    The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.

    Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then

        Da1 Db1 Dc1 Pa1 Pb1
        Da2 Db2 Dc2 Pa2 Pb2
        Da3 Db3 Pa3 Pb3 ___
    

    would after RAID-Z expansion would become

        Da1 Db1 Dc1 Pa1 Pb1 Da2
        Db2 Dc2 Pa2 Pb2 Da3 Db3 
        Pa3 Pb3 ___ ___ ___ ___
    

    Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.

    However if the same data was written in the post-expanded vdev configuration, it would have become

        Da1 Db1 Dc1 Dd1 Pa1 Pb1
        Da2 Db2 Dc2 Dd2 Pa2 Pb2
        ___ ___ ___ ___ ___ ___
    

    Ie, you'd have 6 free blocks not just 4 blocks.

    Of course this doesn't count for writes which end up taking less than the maximal stripe width.

    [1]: https://www.youtube.com/watch?v=tqyNHyq0LYM

    [2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

    • Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.

      There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.

      2 replies →