Comment by rekoil
5 days ago
Yaeh it's a pretty huge caveat to be honest.
Da1 Db1 Dc1 Pa1 Pb1
Da2 Db2 Dc2 Pa2 Pb2
Da3 Db3 Dc3 Pa3 Pb3
___ ___ ___ Pa4 Pb4
___ represents free space. After expansion by one disk you would logically expect something like:
Da1 Db1 Dc1 Da2 Pa1 Pb1
Db2 Dc2 Da3 Db3 Pa2 Pb2
Dc3 ___ ___ ___ Pa3 Pb3
___ ___ ___ ___ Pa4 Pb4
But as I understand it it would actually expand to:
Da1 Db1 Dc1 Dd1 Pa1 Pb1
Da2 Db2 Dc2 Dd2 Pa2 Pb2
Da3 Db3 Dc3 Dd3 Pa3 Pb3
___ ___ ___ ___ Pa4 Pb4
Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.
Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.
ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.
The slides here explain how it works:
https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.
I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.
I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.
Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.
You can see this in the presentation[1] slides[2].
The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.
Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then
would after RAID-Z expansion would become
Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.
However if the same data was written in the post-expanded vdev configuration, it would have become
Ie, you'd have 6 free blocks not just 4 blocks.
Of course this doesn't count for writes which end up taking less than the maximal stripe width.
[1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
[2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.
There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.
What are the errors? I tried to show exactly what you talk about.
edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.
The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.