Comment by Glemkloksdjf

14 hours ago

Could you elaborate?

Did your HDDs then wrote broken data?

Do you use ZFS? I think that should have indicated some HDD issues independent of a counter by transferspeed or something like this.

Not that the backplane was just not good in handling SAS?

1 comment

Glemkloksdjf

Helmut10001 12 hours ago

Yes, I am using ZFS (Raidz2), which is exactly why I know the data remained intact despite the transmission errors. Here is the breakdown of what happened during my recent migration 5 weeks ago and why the SAS protocol made the difference:

> Did the HDDs write broken data?

No. ZFS handles end-to-end data integrity via checksums. If the data had been corrupted during the transfer, ZFS would have flagged it as CKSUM errors in zpool status. Because ZFS validates the data after it crosses the wire, I could be confident that the data written to disk was valid, or the write would have been rejected/retried. However, ZFS did not notice any errors because the SATA/SAS firmware error correction were (still) able to handle this through repeated submission of signals.

> SAS vs. SATA

The issue was that my old backplane (BPN-SAS-743TQ) was electrically failing (signal degradation), but not dead yet. For years, my SATA drives likely engaged in silent retries or masked these marginal signal issues. I know I had issues for years, where I suspected bad cables (this had cost me $500 to replace several cables to try solving this). Standard SATA SMART attributes often only flag UDMA_CRC_Error_Count if things get very bad, but they are generally less verbose about link stability to the HBA. As soon as I plugged in the SAS drives, the mpt3sas driver and the drives themselves flooded dmesg and SMART logs with Invalid DWORD count errors. SAS has robust Physical Layer (PHY) error reporting. It wasn't waiting for data corruption. It was alerting me that the signal integrity on the way between the HBA and the drive was compromised.

> Was the backplane just incompatible?

No, it was physically faulty. I performed a standard isolation test:

    - I took a "suspect" SAS drive and moved it to a known-good bay -> No errors.
    - I took a known-good SAS drive and put it in the "suspect" bay -> Immediate errors.

I did the same with the cables. During both tests (drives & cables), the errors stayed with the specific bays on the backplane. This proved the backplane was introducing signal noise on some bays. The SAS drives screamed about this noise immediately, whereas the SATA drives had been tolerating (and hiding) it.

To sum up, ZFS kept the data safe (logical layer), but switching to SAS enterprise drives exposed a rotting hardware component (physical layer) that consumer SATA drives were ignoring. If I had been using a hardware RAID card with SATA, I likely wouldn't have known until a drive dropped offline entirely.

I later read that backplanes do degrade regularly due to temperature stress etc. I put in a replacement backplane and all was solved.