← Back to context

Comment by reacharavindh

6 years ago

As a heavy user of ZFS and Linux, what else is there that even comes close to what ZFS offers?

I want cheap and reliable snapshots, export & import of file systems like ZFS datasets, simple compression, caching facilities(like SLOG and ARC) and decent performance.

Bcachefs is probably the only thing that will get there. The codebase is clean and we'll mantained, built from solid technology (bcache) and will include most of the ZFS niceties. I just wish more companies would sponsor de project and stop wasting money on BTRFS

  • Yes, I’m eagerly waiting for Bcachefs to get there at some point, but it is several years away (rightly so, because it is hard and the developer is doing an amazing job) if my understanding of its state is correct.

    I have heard of durability issues with btrfs, and do not want to touch it if it fails with its primary job.

  • Which is why ZFS is still a thing today - there are no other alternatives. Everything is coming "soon" while ZFS is actually here and clocking up a stability track-record.

  • >Bcachefs is probably the only thing that will get there.

    Or Bcachefs is probably the only thing that might get there.

    The amount of engineering hours went into ZFS is insane. It is easy to get a project that has 80% similarity on the surface, but then you spend the same amount of time from 0 - 80% on the last 20% and edge cases. ZFS has been battle tested by many. Rsync is on ZFS.

    The amount of Petabyte stored in ZFS safely over the years gives peace of mind.

    Speaking of Rsync, normally a topic of ZFS on HN will have him resurface. Hasn't seen any reply from him yet.

  • I’m looking forward to bcachefs becoming feature complete and upstreamed. We finally have a good chance of having a modern and reliable FS in the Linux kernel. My wish list includes snapshots and per volume encryption.

  • What if the main purpose of BTRFS is to have something "good enough" so no one starts working on a project that can compete with large commercial storage offerings?

    Does anyone remember the parity patches they rejected in 2014?

    > Your work is very very good, it just doesn’t fit our business case.

    I haven't followed it much. Does it have anything more than mirroring (that's stable) these days?

  • >stop wasting money on BTRFS

    You're saying they should stop supporting a project that was considered stable by the time the other started being developed. Why do that? What makes Bcachefs a better choice?

    • Take a cursory look into both codebases, the stability of the every feature at launch and on maintenance. It's not hard to see BTRFS is a doomed project. Bcachefs is more like PostgreSQL, the developer doesn't add features until he has a solid design that's well thought out. Hence why he hasn't implemented snapshots.

      I don't think too many people consider it stable enough for production, either. (Unless you count a very limited subset of its functionality).

      I rather run Bcachefs today than Btrfs, by a mile. At least with bcachefs I won't lose my data.

      2 replies →

    • Btrfs is the only FS I used that resulted in complete FS corruption losing nearly all data on disk, not once, but 3 times.

      After that, none of the features like compression, snapshots, COW or checksums meant anything to me. I'm much happier with ext4 and xfs on lvm.

      6 replies →

    • I don't think BTRFS has ever been considered stable.

      I think they just said: "The on-disk data structure is stable" and lots of people misinterpreted that as "the whole thing is stable"

      A stable on-disk data structure just means it's been frozen and can't be changed in non-backwards compatible ways. It says nothing about code quality, feature completeness or if the frozen data structure was any good.

      1 reply →

  • Snapshots don't seem to be done yet.

    • Kent has admitted (many times) that snapshots are one of the more difficult features to add in a reliable and safe way, and will require significant work to do right, especially for what he wants to see them do (I assume "really damn fast and low overhead" is a major one, plus some other tricks he has up his sleeve.) So he has intentionally not tackled them yet, instead going after a slew of other features first. Reflink, full checksum, replication, caching, compression, native encryption, etc. All of that works today.

      Snapshots are a huge feature for sure, but it's not like bcachefs is completely incapable without them.

      There was a very recent update he gave in late December (2019) that mentioned he's actively chipping away at the roadblocks for snapshots.

I know this isn't an option for everyone, but this is part of why I run FreeBSD instead of Linux for servers where I need ZFS.

  • This isn't why I started running FreeBSD, but it is also one of the reasons I continue to run FreeBSD.

    • Yes, I run Linux for business but keep a FreeBSD personally so I'm used to it in case I need zfs for business.

I agree that ZFS has a lot to offer. But the legal difficulties in merging ZFS support into the mainline kernal are understandable. It's a shame but I think he is making the right call.

  • Merging into the mainline kernel is not what the person he is replying to was even asking for. All they were asking is for Linux to stop putting APIs behind DRM that prevents non-GPL modules like ZFS from using them. That doesn't mean ZFS must be bundled with Linux.

    I think everyone is in agreement that ZFS can't be included in the mainline kernel. The question is just if users should be able to install and use it themselves or not.

  • Kernal? If you can merge zfs support into 8KB kernal then you are not a mere mortal, so no need to worry about any legal difficulties.

XFS on LVM thin pool LV should give you a very robust fs, cheap CoW snapshots, multi device support. If you want, you can make the thin pool be on RAID via LVM RAID under the thin pool.

For import export, IIRC XFS has support for it and you can dump/import LV snapshots to get atomicity.

For caching there is LVM cache, should be again possible to combine with thinpool & RAID. Or you can use it separately for normal LV.

All this is functionality tested by years of production use.

For compression/deduplication, that is AFAIK work in progress upstream based on the open sourced VDO code.

  • Interesting combination of tools I have used independently but never as a replacement of my beloved ZFS.

    Never made snapshots with LVM. Always used LVM as a way to carve up logical storage from a pool of physical devices but nothing more. I need to RTFM on how snapshotting would work there - could I restore just a few files from an hour ago while letting everything else be as they are?

    With ZFS, I use RAM as read chace(ARC) and an Optane disk as sync write cache(SLOG). I wonder if LVM cache would let me do such a thing. Again, a pointer for more manual reading for me.

    Compression is a nice to have for me at this moment. Good to know that it is being worked on at the LVM layer.

  • Call me when somebody like a major cloud provider has used this system to drive millions of hard-drives. I'm not gone patch my data security together like that.

    There is difference between 'all these tools have been used in production' and 'this is an integrated tool that has been used for 15+ years in the biggest storage installations in the world'.

    • Yes! The problem with the LVM approach trying to replicate anything ZFS is doing that you have to use a myriad of different tools. And then you have to pray that they all work correctly together, and if one has a bug you possible lost all your data because there may be so many data corruptions emerging because of it.

Honestly asking, how Btrfs compares to ZFS?

There's also Lustre but it's a different beast altogether for a different scenario.

  • On the surface, btrfs is pretty close to zfs.

    Once you actually use them, you discover all the ways that btrfs is a pain and zfs is a (minor) joy:

    - snapshot management

    - online scrub

    - data integrity

    - disk management

    I lost data from perfectly healthy-appearing btrfs systems twice. I've never lost data on maintained zfs systems, and I now trust a lot more data to zfs than I ever have to btrfs.

    • At least disk management is far easier with btrfs. You can restripe at will while zfs has severe limitations around resizing, adding and removing devices.

      Granted, at enterprise scale this hardly matters because you can just send-receive to rebuild pools if you have enough spares, but for consumer-grade deployments it's a non-negligible annoyance.

      4 replies →

    • Since the plural of anecdote is data, I'll provide mine here. ZFS is the only file-system from which I've lost data on hardware that was functioning properly, though that does come with a caveat.

      Twice btrfs ended up in a non-mountable situation, but both times it was due to a known issue and #btrfs on freenode was able to walk me through getting it working again.

      With ZFS, I neded up in a non-mountable system, and the response in both #zfs and #zfsonlinux to me posting the error message were, "that sucks, hope you had backups." Since I both had backups and it was my laptop 2000 miles from home that was my only computing device, I didn't dig deeper to see if I could discover the problem. FWIW, I've been using ZFS on that same hardware for almost 2 years since with no issues.

    • Thanks for your answer and sorry for your data loss.

      > I lost data from perfectly healthy-appearing btrfs systems twice.

      I still consider btrfs as beta-level software. This is why I never looked into it very seriously and asked this question.

      Looks like btrfs has something around five years to be considered serious at the scale where ZFS just starting to warm-up.

    • The one thing I can't understand about btrfs is the unknown answer to the question "How much disk space do I have left?". I don't get that being a "this much, maybe" answer

      2 replies →

    • btrfs is such a mess that for a database or VM to be marginally stable, you have to disable the CoW featureset for those files with the +C attribute. It's nowhere near a serious solution.

  • Btrfs has eat my data, and once that happens I will never, ever, ever, literally ever go back to that system. Its unacceptable to me that a system eats data specially after multiple rounds of 'its stable now'.

    But in the end it always turns out that only if you 'use' it correctly it is actually not gone eat your data.

    I used ZFS for far longer and had far fewer issues.

Stratis and VDO have a lot of promise, although it's still a little early. The approach that Stratis has taken is refreshing. It's very simple and reuses lots of already existing stuff so by the time it's released it will already be mature (since the underlying code has been running for many years).

Once a little more guidance comes out about how to properly use VDO and Stratis together, I'll move my personal stuff to it.

So besides the obvious btrfs answer, what about ceph as clustered storage with very fast connectivity?

There is also BeeGFS, I haven't used it but /r/datahoarders sometimes touts it.

Not for linux but I have been keeping an eye on M Dillons DragonFly BSD where he has been working on HAMMER2, which is very interesting.

I don't know much but bcachefs has been making more waves lately also.

I think the bottom line is that people need to have good backup in place regardless.

Does btrfs met your requirements?

  • I've tried btrfs without much luck.

    btrfs still has a write hole for RAID5/6 (the kind I primarily use) [0] and has since at least 2012.

    For a filesystem to have a bug leading to dataloss unpatched for over 8 years is just plain unacceptable.

    I've also had issues even without RAID, particularly after power outages. Not minor issues but "your filesystem is gone now, sorry" issues.

    [0]: https://btrfs.wiki.kernel.org/index.php/RAID56

    • It's not a bug, but an unimplemented feature. They never made any promise that raid5 is production-ready.

      Pretty much all software-raid systems suffer from it unless they explicitly patch over it via journaling. Hardware raid gets away with it if it has battery backups, if they don't they suffer from exactly the same problem.

      1 reply →

    • My home NAS runs btrfs in RAID 5. The key is to use software RAID / LVM to present a single block device to btrfs. That way you never use btrfs's screwed-up RAID 5/6 implementation.

      3 replies →

    • Why use RAID5/6, RAID10 is much more safe because you drastically reduce the change of a cascading resilvering failure. Yes, you get less capacity per drive, but drives are (relatively) cheap.

      I thought I wanted RAID5, but after reading horror stories of drives failing when replacing a failed drive, I decided it just wasn't worth the risk.

      I currently run RAID1, and when I need more space, I'll double my drives and set up RAID10. I don't need most of the features of ZFS, so BTRFS works for me.

      1 reply →

  • btrfs is not at all reliable, so if you care about your files staying working files, it probably doesn't meet your requirements. It is like the MongoDB 0.1 of filesystems.

Hardware RAID controllers can do most if not all of these things.

  • I've lost more data in hardware RAID than in ZFS but I have lost data in both.

    Hardware RAID has very poor longevity. Vendor support and battery backup replacement collide in BIOS and host management badly.

    Disclaimer: I work on Dell rackmounts, which means rather than native SAS I am 'Dells hack on SAS' which is a problem and I know its possible to 'downgrade' back to native.

    • Yeah we started ordering the ones with the supercap so we didn’t have to replace batteries anymore.

      Somewhat recently I dealt with LSI and Dell cards. Longevity seemed just fine for a normal 3 year server lifecycle. The only time we had an issue is when the power went down in the data center. The power spike fried a few of the cards. Luckily we had spares.

      Way way back I dealt with the Compaq/hp smartarrays. Those were awful. Also anything consumer grade is awful.

  • The problem with most of these is you have to bring the system down to do maintenance. You can do a scrub on zfs while it's up.

    • Most non-hobbyist RAID hardware does online-scrub just fine (not that I would recommend wasting money on such hw).

      Btw, ZFS scrub is not only a RAID-block-check but also a partial fsck, so its not really comparable.

      1 reply →

  • Pay more for less safety and put all your data into the hands of the guy who wrote the firmware for that thing. I'm sure that software is well maintained open source code.