Comment by thatcks

20 days ago

The article is correct but it downplays an important limitation of ZFS scrubs when it talks about how they're different from fsck and chkdsk. As the article says (in different words), ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems. Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems, and as it stands today ZFS doesn't have anything that either checks or corrects these. Sometimes you find them through incorrect results; sometimes you discover they exist through ZFS assertion failures triggering kernel panics.

(We run ZFS in production and have not been hit by these issues, at least not that we know about. But I know of some historical ZFS bugs in this area and mysterious issues that AFAIK have never been fully diagnosed.)

29 comments

thatcks

mustache_kimono 20 days ago

    "Scrubs differ significantly from traditional filesystem checks. Tools such as fsck or chkdsk examine logical structures and attempt to repair inconsistencies related to directory trees, allocation maps, reference counts, and other metadata relationships. ZFS does not need to perform these operations during normal scrubs because its transactional design ensures metadata consistency. Every transaction group moves the filesystem from one valid state to another. The scrub verifies the correctness of the data and metadata at the block level, not logical relationships."

> ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems

A scrub literally reads the object from disk. And, for each block, the checksums are read up the tree. The object is therefore guaranteed to be correct and consistent at least re: the tree of blocks written.

> Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems

Can you give a more concrete example of what you mean? It sounds like you have some experience with ZFS, but "ZFS doesn't have an fsck" is also some truly ancient FUD, so you will forgive my skepticism.

I'm willing to believe that you request an object and ZFS cannot return that object because of ... a checksum error or a read error in a single disk configuration, but what I have never seen is a scrub that indicates everything is fine, and then reads which don't return an object (because scrubs are just reads themselves?).

Now, are things like pool metadata corruption possible in ZFS? Yes, certainly. I'm just not sure fsck would or could help you out of the same jam if you were using XFS or ext4. AFAIK fsck may repair inconsistencies but I'm not sure it can repair metadata any better than ZFS can?

magicalhippo 20 days ago
> Can you give a more concrete example of what you mean?
There's been several instances. For example, the send/receive code has had bugs leading to cases[1] where the checksum and hence scrub look fine but the data is not.
edit: the recent block cloning has also had some issues, eg[2][3].
I'm pretty sure it's also possible for hardware errors like bad memory to cause the data to get corrupted but the checksum gets computed on the corrupted data, thus it looks ok when scrubbed.
[1]: https://github.com/openzfs/zfs/issues/4809
[2]: https://github.com/openzfs/zfs/issues/15526
[3]: https://github.com/openzfs/zfs/issues/15933
- mustache_kimono 20 days ago
  
  > There's been several instances.
  I think you're missing the 2nd feature to the parent's point that I take issue with, which is this is not just a bug that a scrub wouldn't find, but it must also be a bug which an fsck would find.
  The parent's point is -- ZFS should have an fsck tool because an fsck does something ZFS cannot do by other means. I disagree. Yes, ZFS has bugs like any filesystem. However, I'm not sure an fsck tool would make that situation better?
  
  2 replies →
- SubjectToChange 20 days ago
  
  I like how ZFS doesn’t have “bugs”, it has “defects”.
agapon 20 days ago
Generally, it's possible to have data which is not corrupted but which is logically inconsistent (incorrect).
Imagine that a directory ZAP has an entry that points to a bogus object ID. That would be an example. The ZAP block is intact but its content is inconsistent.
Such things can only happen through a logical bug in ZFS itself, not through some external force. But bugs do happen.
If your search through OpenZFS bugs you will find multiple instances. Things like leaking objects or space, etc. That's why zdb now has support for some consistency checking (bit not for repairs).
- mustache_kimono 20 days ago
  
  > Imagine that a directory ZAP has an entry that points to a bogus object ID. That would be an example. The ZAP block is intact but its content is inconsistent.
  The above is interesting and fair enough, but a few points:
  First, I'm not sure that makes what seems to be the parent's point -- that scrub is an inadequate replacement for an fsck.
  Second, I'm really unsure if your case is the situation the parent is referring to. Parent seems to be indicating actual data loss is occurring. Not leaking objects or space or bogus object IDs. Parent seems to be saying she/he scrubs with no errors and then when she/he tries to read back a file, oops, ZFS can't.
  
  3 replies →
thatcks 20 days ago
Two examples that I can find are https://github.com/openzfs/zfs/issues/7910, where very old versions of ZFS appear to have quietly written slightly incorrect ACL information, and https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/190... where Ubuntu 21.10 shipped with a bug that created corrupted ZFS filesystems. I believe https://www.illumos.org/issues/9847 may be another example of this, although less severe, where ZFS leaked disk space under some circumstances.
- mustache_kimono 20 days ago
  
  > Two examples that I can find
  I think you may be misreading my point above. I am not arguing ZFS doesn't have bugs. That's nuts. I am arguing that the bug the parent says he has would be an extraordinary bug.
  This is not just a bug that a scrub wouldn't find, but also it is a bug which an fsck would find. And it is not just a bug in the spacemaps or other metadata, but the parent's claim is this is a bug which a scrub, which is just a read, wouldn't see, but a subsequent read would reveal.
  
  1 reply →
- E39M5S62 20 days ago
  
  Ubuntu shipped with a bug that they introduced by way of a very badly done patch. While I get your point, I don't think it's fair to use Ubuntu as a source - they're grossly incompetent when it comes to handling ZFS.
ori_b 20 days ago
Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.
- mustache_kimono 20 days ago
  
  > Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.
  A few things: 1) Is this an actual ZFS issue you encountered or is this a hypothetical? 2) And -- you don't imagine this would be discovered during a scrub? Why not? 3) But -- you do imagine it would be discovered and repaired by an fsck instead? Why so? 4) If so, wouldn't this just be a bug, like a fsck, not some fundamental limitation of the system?
  FWIW I've never seen anything like this. I have seen Linux plus a flaky ALPM implementation drop reads and writes. I have seen ZFS notice at the very same moment when the power dropped via errors in `zpool status`. I do wonder if ext4's fsck or XFS's fsck does the same when someone who didn't know any better (like me!) sets the power management policy to "min_power" or "med_power_with_dipm".
  
  3 replies →

wereHamster 20 days ago

A loooong time age (OpenSolaris days) I had a system that had corrupted its zfs. No fsck was available because the developers claimed (maybe still do) that it's unnecessary.

I had to poke around the raw device (with dd and such) to restore the primary superblock with one of the copies (that zfs keeps in different locations on the device). So clearly the zfs devs thought about the possibility of a corrupt superblock, but didn't feel the need to provide a tool to compare the superblocks and restore one from the other copies. That was the point when I stopped trusting zfs.

Such arrogance…

throw0101a 20 days ago
> So clearly the zfs devs thought about the possibility of a corrupt superblock, but didn't feel the need to provide a tool to compare the superblocks and restore one from the other copies.
This mailing list post from 2008 talks about using zdb(8) to mark mark certain uberblocks an invalid so another one would be used:
* https://zfs-discuss.opensolaris.narkive.com/Tx4FaUMv/need-he...
ZDB = ZFS debugger. It's been there since the original Solaris release of ZFS.
> That was the point when I stopped trusting zfs.
As opposed to trusting other file systems and volume managers, which do not have checksums, and so you wouldn't even know about the problem in the first place?
- rincebrain 19 days ago
  
  That's not using zdb to change anything - it's readonly, all the time. The person reached out and used dd on the disk to corrupt the copies of the uberblock with bad data so that ZFS would be forced to use the older ones (what zpool import -T does, basically, but doing it the hard way).
fvv 20 days ago

it's still the case even with now openzfs ? what do you trust now ?
barrkel 20 days ago
That's a fine fit of pique - and I once had an awkward file on one of my zfs pools, about three pools ago - but how does it leave you better off, if you want what zfs offers?
- Dylan16807 20 days ago
  
  > That's a fine fit of pique
  So you're rejecting a story about a real bug because...?
  > how does it leave you better off
  That's a really mercenary way to look at learning about your tools.
  But presumably they take smaller risks around zfs systems than they otherwise would.

p_l 20 days ago

In my experience[1], the fsck for given filesystem will happily replicate the errors, sometimes in random ways, because often it cannot figure which road to take in face of inconsistency. If anything, OpenZFS built upon that by now documenting the previously deeply hidden option to "rewind" ZFS uberblock if the breakage is recent enough.

[1] I've seen combination of ubuntu bug in packaging (of grub, of all things) and e2fsck nearly wipe a small company from existence, because e2fsck ended up trusting the data it got from superblock when it was not consistent.

phil21 20 days ago

> If anything, OpenZFS built upon that by now documenting the previously deeply hidden option to "rewind" ZFS uberblock if the breakage is recent enough.
One of the most "wizardry" moments in my career I've personally witnessed was a deep-level ZFS expert (core OpenZFS developer) we had on retainer come in during a sev0 emergency and rapidly diagnose/rollback a very broken ZFS filesystem to a previous version from a few hours before the incident happened.
This was entirely user error (an admin connected redundant ZFS "heads" to the same JBOD in an incorrect manner so both thought they were primary and both wrote to the disks) that we caught more or less immediately so the damage was somewhat limited. At the time we thought we were screwed and would have to restore from the previous days backup with a multi-day (at best) time to repair.
This was on illumos a few years after the Solaris fork, so I don't think this feature was documented at the time. It certainly was a surprise to me, even though I knew that "in theory" such capability existed. The CLI incantations though were pure wizardry level stuff, especially watching it in real time with terminal sharing with someone who very much knew what they were doing.