← Back to context

Comment by jeltz

4 days ago

Not on most database workloads. There zfs does not scale very well.

Percona and many others who benchmarked this properly would disagree with you. Percona found that ext4 and ZFS performed similarly when given identical hardware (with proper tuning of ZFS):

https://www.percona.com/blog/mysql-zfs-performance-update/

In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:

https://www.percona.com/blog/about-zfs-performance/

This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.

That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.

  • There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.

    The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)

    • L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.

      If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.

      ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:

      https://www.percona.com/blog/zfs-for-mongodb-backups/

      Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?

      2 replies →

  • Refuting the "it doesn't scale" argument with a data from a blog that showcases a single workload (TPC-C) with 200G+10tables dataset (small to medium) at 2vCPU (wtf) machine with 16 connections (no thread pool so overprovisioned) is not quite a definition of a scale at all. It's a lost experiment if anything.

    • The guy did not have any data to justify his claims of not scaling. Percona’s data says otherwise. If you don’t like how they got their data, then I advise you to do your own benchmarks.

      4 replies →