Comment by magicalhippo
4 days ago
Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.
You can see this in the presentation[1] slides[2].
The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.
Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then
Da1 Db1 Dc1 Pa1 Pb1
Da2 Db2 Dc2 Pa2 Pb2
Da3 Db3 Pa3 Pb3 ___
would after RAID-Z expansion would become
Da1 Db1 Dc1 Pa1 Pb1 Da2
Db2 Dc2 Pa2 Pb2 Da3 Db3
Pa3 Pb3 ___ ___ ___ ___
Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.
However if the same data was written in the post-expanded vdev configuration, it would have become
Da1 Db1 Dc1 Dd1 Pa1 Pb1
Da2 Db2 Dc2 Dd2 Pa2 Pb2
___ ___ ___ ___ ___ ___
Ie, you'd have 6 free blocks not just 4 blocks.
Of course this doesn't count for writes which end up taking less than the maximal stripe width.
[1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
[2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.
There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.
What are the errors? I tried to show exactly what you talk about.
edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.
The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.