Understanding ZFS Scrubs and Data Integrity

5 days ago (klarasystems.com)

The article is correct but it downplays an important limitation of ZFS scrubs when it talks about how they're different from fsck and chkdsk. As the article says (in different words), ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems. Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems, and as it stands today ZFS doesn't have anything that either checks or corrects these. Sometimes you find them through incorrect results; sometimes you discover they exist through ZFS assertion failures triggering kernel panics.

(We run ZFS in production and have not been hit by these issues, at least not that we know about. But I know of some historical ZFS bugs in this area and mysterious issues that AFAIK have never been fully diagnosed.)

  •     "Scrubs differ significantly from traditional filesystem checks. Tools such as fsck or chkdsk examine logical structures and attempt to repair inconsistencies related to directory trees, allocation maps, reference counts, and other metadata relationships. ZFS does not need to perform these operations during normal scrubs because its transactional design ensures metadata consistency. Every transaction group moves the filesystem from one valid state to another. The scrub verifies the correctness of the data and metadata at the block level, not logical relationships."
    

    > ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems

    A scrub literally reads the object from disk. And, for each block, the checksums are read up the tree. The object is therefore guaranteed to be correct and consistent at least re: the tree of blocks written.

    > Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems

    Can you give a more concrete example of what you mean?

    • Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

      1 reply →

>HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect data can be expected around every 100 TiB read. That used to be a lot, but now that is only 3 or 4 full drive reads on modern large-scale drives. Silent corruption is one of those problems you only notice after it has already done damage.

While the advice is sound, this number isn't the right number for this argument.

That 10^15 number is for UREs, which aren't going to cause silent data corruption -- simple naive RAID style mirroring/parity will easily recover from a known error of this sort without any filesystem layer checksumming. The rates for silent errors, where the disk returns the wrong data that benefit from checksumming, are a couple of orders of magnitude lower.

>Most systems that include ZFS schedule scrubs once per month. This frequency is appropriate for many environments, but high churn systems may require more frequent scrubs.

Is there a more specific 'rule of thumb' for scrub frequency? What variables should one consider?

  • Once a month is fine ("/etc/cron.monthly/zfs-scrub"):

        #!/bin/bash
        #
        # ZFS scrub script for monthly maintenance
        # Place in /etc/cron.monthly/zfs-scrub
        
        POOL="storage"
        TAG="zfs-scrub"
        
        # Log start
        logger -t "$TAG" -p user.notice "Starting ZFS scrub on pool: $POOL"
        
        # Run the scrub
        if /sbin/zpool scrub "$POOL"; then
            logger -t "$TAG" -p user.notice "ZFS scrub initiated successfully on pool: $POOL"
        else
            logger -t "$TAG" -p user.err "Failed to start ZFS scrub on pool: $POOL"
            exit 1
        fi
        
        exit 0

    • That script might do with the "-w" parameter passed to scrub. Then "zpool scrub" won't return until the scrub is finished.

  • Once a month seems like a reasonable rule of thumb.

    But you're balancing the cost of the scrub vs the benefit of learning about a problem as soon as possible.

    A scrub does a lot of I/O and a fair amount of computing. The scrub load competes with your application load and depending on the size of your disk(s) and their read bandwidth, it may take quite some time to do the scrub. There's even maybe some potential that the read load could push a weak drive over the edge to failure.

    On my personal servers, application load is nearly meaningless, so I do an about monthly scrub from cron which I think will only scrub one zpool at a time per machine, which seems reasonable enough to me. I run relatively large spinning disks, so if I scrubbed on a daily basis, the drives would spend most of the day scrubbing and that doesn't seem reasonable. I haven't run ZFS in a work environment... I'd have to really consider how the read load impacted the production load and if scrubbing with limits to reduce production impact would complete in a reasonable amount of time... I've run some systems that are essentially alwayd busy and if a scrub would take several months, I'd probably only scrub when other systems indicate a problem and I can take the machine out of rotation to examine it.

    If I had very high reliability needs or a long time to get replacement drives, I might scrub more often?

    If I was worried about power consumption, I might scrub less often (and also let my servers and drives go into standby). The article's recommendation to scan at least once every 4 months seems pretty reasonable, although if you have seriously offline disks, maybe once a year is more approachable. I don't think I'd push beyond that, lots of things don't like to sit for a year and then turn on correctly.

  • The cost of a scrub is just a flurry of disk reads and a reduction in performance during a scrub.

    If this cost is affordable on a daily basis, then do a scrub daily. If it's only affordable less often, then do it less often.

    (Whatever the case: It's not like a scrub causes any harm to the hardware or the data. It can run as frequently as you elect to tolerate.)

  • Total pool size and speed. Less data scrubs faster, as do faster disks or disk topology (a 3 way stripe of nvme will scrub faster than a single sata ssd)

    For what its worth, I scrub daily mostly because I can. It's completely overkill, but if it only takes half an hour, then it can run in the middle of the night while I'm sleeping.