← Back to context

Comment by mitchty

10 years ago

As a counterpoint I have 6x3TB zfs raidz2 on freebsd at home. I resilver every month and only had one checksum error that turned out to be a cable going bad given it hasn't repeated.

Still agree with we need checksumming filesystems though. That and gcc ram to make the data written more trustworthy.

had one checksum error that turned out to be a cable going bad given it hasn't repeated

I wouldn't assume that it was a cable error. The SATA interface has a CRC check. So the odds are quite high that a single error would simply result in a retransmission.

Of course, a plethora of detected SATA CRC errors and resulting retransmissions means that an undetected error could readily slip thru. There should be error logs reporting on the occurrence of retransmission, but I'm not enough of a software person to know how possible / easy it is to get that information from a drive or operating system.

OTOH, as you mention later in your post, a single bit error in non-ECC RAM could easily result in a single block having a checksum error. Exactly what you saw!

  • Hard drives also have spare sectors, so if a defect is detected at one spot in the disk, it will probably never touch that spot again.

    Simply observing that an error only occurred once does almost nothing to narrow down the possible causes. You have to also be keeping track of all the error reporting facilities (SMART, PCIe AER, ECC RAM, etc.).