← Back to context

Comment by LtdJorge

4 days ago

There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.

The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)

L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.

If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.

ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:

https://www.percona.com/blog/zfs-for-mongodb-backups/

Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?

  • No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes. The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

    About the inherent advantages of ZFS like send/recv, I have nothing to say. I know how good they are. It's one reason I use ZFS.

    > If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do.

    What does proper testing here mean? And what does "if you scale it" mean? Genuinely. From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting. What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

    Edit: P4800x, actually. The flash disk are D5-P5530.

    • > No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes.

      That makes sense.

      > The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

      it is a balancing act. It is a feature ZFS has that XFS does not, but it is ridiculous to use a device that can fit the entire database as L2ARC, since in that case, you can just use that device directly and keeping it as a cache for ZFS does not make for a fair or realistic comparison. Fast devices that can be used with tiered storage are generally too small to be used as main storage, since if you could use them as main storage, you would.

      With the caveat that the higher tier should be too small to be used as main storage, you can get a huge boost from being able to use it as cache in tiered storage, and that is why ZFS has L2ARC.

      > What does proper testing here mean? And what does "if you scale it" mean?

      Let me preface my answer by saying that doing good benchmarks is often hard, so I can't give a simple answer here. However, I can give a long answer.

      First, small databases that can fit entirely in RAM cache (be it the database's own userland cache or a kernel cache) are pointless to benchmark. In general, anything can run that well (since it is really running out of RAM as you pointed out). The database needs to be significantly larger than RAM.

      Second, when it comes to using tiered storage, the purpose of doing tiering is that the faster tier is either too small or too expensive to use for the entire database. If the database size is small enough that it is inexpensive to use the higher tier for general storage, then a test where ZFS gets the higher tiered storage for use as cache is neither fair nor realistic. Thus, we need to scale the database to a larger size such that the higher tier being only usable as cache is a realistic scenario. This is what I had in mind when I said "if you scale it".

      Third, we need to test workloads that are representative of real things. This part is hard and the last time I did it was 2015 (I had previously said 2016, but upon recollection, I realized it was likely 2015). When I did, I used a proprietary workload simulator that was provided by my job. It might have been from SPEC, but I am not sure.

      Fourth, we need to tune things properly. I wrote the following documentation years ago describing correct tuning for ZFS:

      https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

      https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

      At the time I wrote that, I omitted that tuning the I/O elevator can also improve performance, since there is no one size fits all advice for how to do it. Here is some documentation for that which someone else wrote:

      https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

      If you are using SSDs, you could probably just get away with setting each of the maximum asynchronous queue depth limits to something like 64 (or even 256) and benchmark that.

      > From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting.

      In 2015 when I did database benchmarks, ZFS and XFS were given equal hardware. The hardware was a fairly beefy EC2 instance with 4x high end SSDs. MD RAID 0 was used under XFS while ZFS was given the devices in what was effectively a RAID 0 configuration. With proper tuning (what I described earlier in this reply), I was able to achieve 85% of XFS performance in that configuration. This was considered a win due to the previously stated reason of performance under database backups. ZFS has since had performance improvements done, which would probably narrow the gap. It now uses B-Trees internally to do operations faster and also now has redundant_metadata=most, which was added for database workloads.

      Anyway, on equal hardware in a general performance comparison, I would expect ZFS to lose to XFS, but not by much. ZFS' ability to use tiered storage and do low overhead backups is what would put it ahead.

      > What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

      You need to have a database whose size is so big that optane storage is not practical to use for main storage. Then you need to setup ZFS with Optane storage as L2ARC. You can give regular flash drives to ZFS and XFS on MD RAID in a comparable configuration (RAID 0 to make life easier, although in practice you probably want to use RAID 10). You will want to follow best practices for tuning the database and filesystems (although from what I know, XFS has remarkably few knobs). You could give XFS the optane devices to use for metadata and its journal for fairness, although I do not expect it to help XFS enough. In this situation, ZFS should win on performance.

      You would need to pick a database for this. One option would be PostgreSQL, which is probably the main open source database that people would scale to such levels. The pgbench tool likely could be used for benchmarking.

      https://www.postgresql.org/docs/current/pgbench.html

      You would need to pick a scaling factor that will make the database big enough and do a workload simulating a large number of clients (what is large is open to interpretation).

      Finally, I probably should add that the default script used by pgbench probably is not very realistic for a database workload. A real database will have a good proportion of reads from select queries (at least 50%) while the script that is being used does a write mostly workload. It probably should be changed. How is probably an exercise best left for a reader. That is not the answer you probably want to hear, but I did say earlier in this reply that doing proper benchmarks is hard, and I do not know offhand how to adjust the script to be more representative of real workloads. That said, there is definite utility in benchmarking write mostly workloads too, although that utility is probably more applicable for the database developers than as a way to determine which of two filesystems is better for running the database.