Comment by ryao

4 days ago

Percona and many others who benchmarked this properly would disagree with you. Percona found that ext4 and ZFS performed similarly when given identical hardware (with proper tuning of ZFS):

https://www.percona.com/blog/mysql-zfs-performance-update/

In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:

https://www.percona.com/blog/about-zfs-performance/

This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.

That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.

There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.

The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)

  • L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.

    If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.

    ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:

    https://www.percona.com/blog/zfs-for-mongodb-backups/

    Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?

    • No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes. The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

      About the inherent advantages of ZFS like send/recv, I have nothing to say. I know how good they are. It's one reason I use ZFS.

      > If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do.

      What does proper testing here mean? And what does "if you scale it" mean? Genuinely. From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting. What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

      Edit: P4800x, actually. The flash disk are D5-P5530.

      1 reply →

Refuting the "it doesn't scale" argument with a data from a blog that showcases a single workload (TPC-C) with 200G+10tables dataset (small to medium) at 2vCPU (wtf) machine with 16 connections (no thread pool so overprovisioned) is not quite a definition of a scale at all. It's a lost experiment if anything.

  • The guy did not have any data to justify his claims of not scaling. Percona’s data says otherwise. If you don’t like how they got their data, then I advise you to do your own benchmarks.

    • It is based on data from internal benchmarks. Zfs is fine for database workloads but scales worse than Xfs based on my personal experience. It is unpublished benchmarks and I do not have access to any farm to win a discussion on the internet.

      1 reply →

    • I don't have anything to like or not to like. I'm not a user of ZFS filesystem. I'm just dismissing your invalid argumentation. Percona's data is nothing about the scale for reasons I already mentioned.

      1 reply →