Comment by montjoy

6 years ago

Last time I used ZFS write performance was terrible compared to an ordinary RAID5. IIRC Writes in a raidz are always limited to a single disk’s performance. The only way to get better write speed is to combine multiple raidzs - which means you need a boatload if disks.

We had a bunch of Thumpers (SunFire X4200) with 48 disks at work, running ZFS on Solaris. It was dog slow and awful, tuning performance was complicated and took ages. One had to use just the right disks in just the right order in RaidZs with striping over them. Swap in a hotspare: things slow to a crawl (i.e. not even Gbit/s).

After EoL a colleague installed Linux with dmraid, LVM and xfs on the same hardware: much faster, more robust. Sorry, don't have numbers around anymore, stuff has been trashed since.

Oh, and btw., snapshots and larger numbers of filesystems (which Sun recommended instead of the missing Quota support) also slow things down to a crawl. ZFS is nice on paper and maybe nice to play with. Definitely simpler to use than anything else. But performance-wise it sucked big time, at least on Solaris.

  • ZFS, on Solaris, not robust?

    ZFS for “play”?!

    This... is just plain uninformed.

    Not just me and my employer, but many (many) others rely on ZFS for critical production storage, and have done so for many years.

    It’s actually very robust on Linux as well - considering the fact that freeBSD have started to use the ZoL code base is quite telling.

    Would freeBSD also be in the “play” and “not robust” category as well, hanging out together with Solaris?

    Will it perform better than all in terms of writes/s? Most likely not - although by staying away from de-dup, enough RAM and adhere the pretty much general recommendation to use mirror vdevs only in your pools, it can be competitive.

    Something solid with data integrity guarantees? You can’t beat ZFS, imo.

    • > Something solid with data integrity guarantees? You can’t beat ZFS, imo.

      This reminds me. We had one file server used mostly for package installs that used ZFS for storage. One day our java package stops installing. The package had become corrupt. So I force a manual ZFS scrub. No dice. Ok fine I’ll just replace the package. It seems to work but the next day it’s corrupt again. Weird. Ok I’ll download the package directly from Oracle again. The next day again it’s corrupt. I download a slightly different version. No problems. I grab the previous problematic package and put it in a different directory (with no other copies on the file system) - again it becomes corrupt.

      There was something specific about the java package that ZFS just thought it needed to “fix”. If I had to guess it was getting the file hash confused. I’m pretty sure we had dedupe turned on so that may have factored into it.

      Anyway that’s the first and only time I’ve seen a file system munge up a regular file for no reason - and it was on ZFS.

    • Performance wasn't robust, especially on dead disks and rebuilds, but also on pools with many (>100) filesystems or snapshots. Performance would often degrade heavily and unpredictably on such occasions. We didn't loose data more often than with other systems.

      "play" comes from my distinct impression that the most vocal ZFS proponents are hobbyists and admins herding their pet servers (as opposed to cattle). ZFS comes at low/no cost nowadays and is easy to use, therefore ideal in this world.

      1 reply →

  • Sounds like you turned on dedupe, or had an absurdly wide stripe size. You do need to match your array structure to your needs as well as tune ZFS.

    On our backup servers (45 disks, 6-wide Z2 stripes) easily handle wire-speed 10G with 32G ARC.

    And you're just wrong about snapshots and filesystem counts.

    ZFS is no speed demon, but it performs just fine if you set it up correctly and tune it.

    • Stripe size could have been a problem, though we just went with the default there afair. Most of the first tries was just along the Sun docs, we later only changed things until performance was sufficient. Dedupe wasn't even implemented back then.

      Maybe you also don't see as massive an impact because your hardware is a lot faster. X4200s were predominantly meant to be cheap, not fast. No cache, insufficient RAM, slow controllers, etc.

      1 reply →

  • ZFS performs quite well if you give it boatloads of RAM. It uses its own cache layer, and eats RAM like hotcakes. XFS OTOH is as fast as the hardware can go with any amount of RAM.

    • Sort of. But no snapshots.

      Wanna use LVM for snapshots? 33% performance hit for the entire LV per snapshot, by implementation.

      ZFS? ~1% hit. I've never been able to see any difference at the workloads I run, whereas with LVM it was pervasive and inescapable.

      6 replies →

  • "After EoL a colleague installed Linux with dmraid, LVM and xfs on the same hardware: much faster, more robust."

    Please let me know which company this is, so I can ensure that I never end up working there by accident. Much obliged in advance, thank you kindly.

  • dmraid raid5/6 lose data, sometimes catastrophically, in normal failure scenarios that the ZFS equivalent handles just fine. If a sector goes bad between the time when you last scrubbed and the time when you get a disk failure (which is pretty much inevitable with modern disk sizes), you're screwed.

> Writes in a raidz are always limited to a single disk’s performance

what? no. why would that be the case? You lose a single disk's performance due to the checksumming.

just from my personal NAS I can tell you that I can do transfers from my scratch drive (NVMe SSD) to the storage array at more than twice the speed of any individual drive in the array... and that's in rsync which is notably slower than a "native" mv or cp.

The one thing I will say is that it does struggle to keep up with NVMe SSDs, otherwise I've always seen it run at drive speed on anything spinning, no matter how many drives.

  • > what? no. why would that be the case? You lose a single disk's performance due to the checksumming.

    I think they are probably referring to the write performance of a RAIDZ VDEV being constrained by the performance of the slowest disc within the VDEV.

    • true, if you have 7 fast disks and one slow disk in a raidz, you get 7 x slow disk performance.