Well, first of all: I'm not trying to bash BTRFS at all, it probably is just not meant for me. However, I'm trying to gain information it is really considered stable (like rock solid) or it might have been a hardware Problem on my system.
I used cryptsetup with BTRFS because I encrypt all of my stuff. One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]). Not a real problem because I had a recent backup, but somehow I lost trust in BTRFS that day. Anyone experienced something like that?
Since then I switched to ZFS (on the same hardware) and never had problems - while it was a real pain to setup until I finished my script [2], which still is kind of a collection of dirty hacks :-)
Yes, my story with btrfs is quite similar- used it for a couple years, suddenly threw some undocumented error and refused to mount, asked about it on the dev irc channel and was told apparently it was a known issue with no solution, have fun rebuilding from backups. No suggestion that anyone was interested in documenting this issue, let alone fixing it.
These same people are the only ones in the world suggesting btrfs is "basically" stable. I'll never touch this project again with a ten foot pole, afaic it's run by children. I'll trust adults with my data.
I ran it on opensuse and it would 100% lock out a core on some sort of crontab tree structure rebalancing (?) .. I mean.. hello? online algorithm? Dynamic rebalancing? scheduled FS restructuring, really? ReiserFS dancing trees from 20 years ago? If thats how they think, "meh, the user just has to deal", no wonder this is how they handle bugs.
Ok, thank you. At least I'm not alone with this. However, I'm not too much into it and would not go as far to say it's not a recommendable project, but boy was I mad it just died without any way to recover ANYTHING :-)
I worked on a linux distro some years ago that had to pull btrfs long after people had started saying thats its truly solid because customers had so many issues. Its probably improved since but its hard to know. Im surprised fedora workstation defaults to it now. I'm hoping bcachefs finds its way in the next few years as being the rock solid fs it aims to be.
Yeah what really made me wonder is that I thought I had incomplete and wrong manpages in the recovery sections... examples did not work as described, but I can't remember what it was, I was too mad and ditched it completely :-)
I've used it as my desktops main filesystem for many years and not had any problems. I have regular snapshots with snapper. I run the latest kernel, so ZFS is not an option.
That said, I avoid it like the plague on servers, to get acceptable performance (or avoid fragmentation) with VMs or databases you need to disable COW which disables many of it's features, so it's better just to roll with XFS (and get pseudo-snapshots anyway).
Have you used 4K sectors with cryptsetup? Many distributions still defaults to 512 bytes if SSD reports 512 bytes as its logical size and with 512 sectors there are heavier load on the system.
I was reluctant to use BTRFS on my Linux laptop but for the last 3 years I have been using it with 4K cryptsetup with no issues.
My btrfs filesystem has been slowly eating my data for a while; large files will find their first 128k replaced with all nulls. Rewriting it will sometimes fix it temporarily, but it'll revert back to all nulls after some time. That said, this might be my fault for using raid6 for data and trying to replace a failing disk a while ago.
raid 5/6 is completely broken and there's not much interest in fixing it — nobody who's willing to pay for its development (which includes Facebook, SUSE, Oracle, and WD) uses raid 5/6; you shouldn't have been running it in the first place. I understand it's basically blaming the victim, but doing at least some research on the filesystem before starting to use it is a good idea in any case.
Thank you for your opinion. Well... it did not just fail. Cryptsetup mounted everything fine, but the BTRFS tools did not find a valid filesystem on it.
While it could have been a bit flip that destroyed the whole encryption layer, BTRFS debugging revealed that there was some traces of BTRFS headers after mounting cryptsetup and some of the data on the decrypted partition was there...
This probably means the encryption layer was fine. The BTRFS part just could not be repaired or restored. The only explanation I have for this that something resulted in a dirty write, which destroyed the whole partition table, the backup partition table and since I used subvolumes and could not restore anything, most of the data.
Well, maybe it was my fault but since I'm using the exact same system with the same hardware right now (same NVMe SSD), I really doubt that.
What the hell are you talking about? Any filesystem on any OS I've seen the last 3 decades had some kind of recovery path after any crash. Some of them lose more data, some of them less. But being unable to mount, is a bug that makes a filesystem untrustworthy and useless.
I wonder if I can use a smaller SSD for this and make it avoid HDD wakeups due to some process reading metadata. That alone would make me love this feature.
I think you'd rather want a cache device (or some more complicated storage tiering) for that so that both metadata and frequently accessed files get moved to that dynamically based on access patterns.
Afaik btrfs doesn't support that. LVM, bcache, device mapper, bcachefs and zfs support that (though zfs would require separate caches for reading and synchronous write). And idk which of these let you control the writeback interval.
Most likely yes, but the also envisioned periodically repacking oft multiple small data extents into one big that gets written to the HDD would wake up the HDD. And if you'd make the SSD "metadata only", browser cache and logging will keep the HDD spinning.
This feature is for performance, not the case you described.
Just buy more RAM and you get that for free. Really I guess that's my sense of patches like this in general: while sure, filesystem research has a long and storied history and it's a very hard problem in general that attracts some of the smartest people in the field to do genius-tier work...
Does it really matter in the modern world where a vanilla two-socket rack unit has a terabyte of DRAM? Everything at scale happens in RAM these days. Everything. Replicating across datacenters gets you all the reliability you need, with none of the fussing about storage latency and block device I/O strategy.
Sun's ZFS7420 had a terabyte of RAM per controller, and these work in tandem, and after a certain pressure, the thing can't keep up even though it also uses specialized SSDs to reduce HDD array access during requests, and these were blazingly fast boxes for their time.
When you drive a couple thousand physical nodes with a some-petabytes sized volumes, no amount of RAM can save you. This is why Lustre divides metadata servers and volumes from file ones. You can keep very small files in metadata area (a-la Apple's 0-sized, data-in-resource-fork implementation), but for bigger data, you need to have good filesystems. There are no workarounds from this.
If you want to go faster, take a look at Weka and GPUDirect. Again, when you are pumping tons of data to your GPUs to keep them training/inferring, no amount of RAM can keep that data (or sustain the throughput) during that chaotic access for you.
When we talked about performance, we used to say GB/sec. Now a single SSD provides that IOPS and throughput provided by storage clusters. Instead, we talk about TB/sec in some cases. You can casually connect terabit Ethernet (or Infiniband if you prefer that) to a server with a couple of cables.
Some time ago (back when we were using spinning rust) I was wondering whether one could bypass the latency of disk access when replicating to multiple hosts. I mean, how likely is it, that two hosts crash at the same time? Well, it turns out that there are some causes which take out multiple hosts simultaneously (a way too common occurrence seems to be diesel generators which fail to start after power failure). I think the good fellas at Amazon, Meta and Google even have stories to tell about a whole data center failing. So you need replication across data centers, but then network latency bites ya. Current NVMe storage devices are then faster (and for some access patterns nearly as fast as RAM).
And that's just at the largest scale. I'm pretty sure banks still insist that the data is written to (multiple) disks (aka "stable storage") before completing a transaction.
Considering that multiple ZFS developers get paid to make ZFS work well on petabyte-sized disk arrays with SSD caching, and one of them often reports on progress in this area in his podcasts (2.5admins.com and bsdnow if you're interested) .. then yes?
Well, first of all: I'm not trying to bash BTRFS at all, it probably is just not meant for me. However, I'm trying to gain information it is really considered stable (like rock solid) or it might have been a hardware Problem on my system.
I used cryptsetup with BTRFS because I encrypt all of my stuff. One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]). Not a real problem because I had a recent backup, but somehow I lost trust in BTRFS that day. Anyone experienced something like that?
Since then I switched to ZFS (on the same hardware) and never had problems - while it was a real pain to setup until I finished my script [2], which still is kind of a collection of dirty hacks :-)
1: https://forum.cgsecurity.org/phpBB3/viewtopic.php?t=13013
2: https://github.com/sandreas/zarch
Yes, my story with btrfs is quite similar- used it for a couple years, suddenly threw some undocumented error and refused to mount, asked about it on the dev irc channel and was told apparently it was a known issue with no solution, have fun rebuilding from backups. No suggestion that anyone was interested in documenting this issue, let alone fixing it.
These same people are the only ones in the world suggesting btrfs is "basically" stable. I'll never touch this project again with a ten foot pole, afaic it's run by children. I'll trust adults with my data.
I ran it on opensuse and it would 100% lock out a core on some sort of crontab tree structure rebalancing (?) .. I mean.. hello? online algorithm? Dynamic rebalancing? scheduled FS restructuring, really? ReiserFS dancing trees from 20 years ago? If thats how they think, "meh, the user just has to deal", no wonder this is how they handle bugs.
Ok, thank you. At least I'm not alone with this. However, I'm not too much into it and would not go as far to say it's not a recommendable project, but boy was I mad it just died without any way to recover ANYTHING :-)
I worked on a linux distro some years ago that had to pull btrfs long after people had started saying thats its truly solid because customers had so many issues. Its probably improved since but its hard to know. Im surprised fedora workstation defaults to it now. I'm hoping bcachefs finds its way in the next few years as being the rock solid fs it aims to be.
I hadn't heard of bcachefs, but I looked it up and apparently Linus just removed it from the kernel source tree last month for non-technical reasons.
https://en.wikipedia.org/wiki/Bcachefs#History
1 reply →
Yeah what really made me wonder is that I thought I had incomplete and wrong manpages in the recovery sections... examples did not work as described, but I can't remember what it was, I was too mad and ditched it completely :-)
I've used it as my desktops main filesystem for many years and not had any problems. I have regular snapshots with snapper. I run the latest kernel, so ZFS is not an option.
That said, I avoid it like the plague on servers, to get acceptable performance (or avoid fragmentation) with VMs or databases you need to disable COW which disables many of it's features, so it's better just to roll with XFS (and get pseudo-snapshots anyway).
In the unlikely case you're running SQLite, it's possible to get okay performance on btrfs too:
https://wiki.tnonline.net/w/Blog/SQLite_Performance_on_Btrfs
Have you used 4K sectors with cryptsetup? Many distributions still defaults to 512 bytes if SSD reports 512 bytes as its logical size and with 512 sectors there are heavier load on the system.
I was reluctant to use BTRFS on my Linux laptop but for the last 3 years I have been using it with 4K cryptsetup with no issues.
I used the default archinstall... did not check the sector size, but good to hear it works for you. Maybe I'll check again with my next setup.
1 reply →
My btrfs filesystem has been slowly eating my data for a while; large files will find their first 128k replaced with all nulls. Rewriting it will sometimes fix it temporarily, but it'll revert back to all nulls after some time. That said, this might be my fault for using raid6 for data and trying to replace a failing disk a while ago.
raid 5/6 is completely broken and there's not much interest in fixing it — nobody who's willing to pay for its development (which includes Facebook, SUSE, Oracle, and WD) uses raid 5/6; you shouldn't have been running it in the first place. I understand it's basically blaming the victim, but doing at least some research on the filesystem before starting to use it is a good idea in any case.
https://btrfs.readthedocs.io/en/latest/Status.html
edit: just checked, it says the same thing in man pages — not for production use, testing/development only.
> One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]).
it looks like you didn't use raid, so any FS could fail in case of disk corruption.
Thank you for your opinion. Well... it did not just fail. Cryptsetup mounted everything fine, but the BTRFS tools did not find a valid filesystem on it.
While it could have been a bit flip that destroyed the whole encryption layer, BTRFS debugging revealed that there was some traces of BTRFS headers after mounting cryptsetup and some of the data on the decrypted partition was there...
This probably means the encryption layer was fine. The BTRFS part just could not be repaired or restored. The only explanation I have for this that something resulted in a dirty write, which destroyed the whole partition table, the backup partition table and since I used subvolumes and could not restore anything, most of the data.
Well, maybe it was my fault but since I'm using the exact same system with the same hardware right now (same NVMe SSD), I really doubt that.
3 replies →
What the hell are you talking about? Any filesystem on any OS I've seen the last 3 decades had some kind of recovery path after any crash. Some of them lose more data, some of them less. But being unable to mount, is a bug that makes a filesystem untrustworthy and useless.
And how would RAID help in that situation?
1 reply →
I was surprised of the new attempt for performance profiles/device roles/hints when we already have a very good patch set maintained by kakra.
- https://github.com/kakra/linux/pull/36
- https://wiki.tnonline.net/w/Btrfs/Allocator_Hints
What do you think?
> One of the reasons why these patches are not included in the kernel is that the free space calculations do not work properly.
It seems these patches possibly fix that.
I wonder if I can use a smaller SSD for this and make it avoid HDD wakeups due to some process reading metadata. That alone would make me love this feature.
I think you'd rather want a cache device (or some more complicated storage tiering) for that so that both metadata and frequently accessed files get moved to that dynamically based on access patterns. Afaik btrfs doesn't support that. LVM, bcache, device mapper, bcachefs and zfs support that (though zfs would require separate caches for reading and synchronous write). And idk which of these let you control the writeback interval.
Bcache allows lots of writeback configuration, including intervals https://www.kernel.org/doc/html/latest/admin-guide/bcache.ht...
Most likely yes, but the also envisioned periodically repacking oft multiple small data extents into one big that gets written to the HDD would wake up the HDD. And if you'd make the SSD "metadata only", browser cache and logging will keep the HDD spinning.
This feature is for performance, not the case you described.
Just buy more RAM and you get that for free. Really I guess that's my sense of patches like this in general: while sure, filesystem research has a long and storied history and it's a very hard problem in general that attracts some of the smartest people in the field to do genius-tier work...
Does it really matter in the modern world where a vanilla two-socket rack unit has a terabyte of DRAM? Everything at scale happens in RAM these days. Everything. Replicating across datacenters gets you all the reliability you need, with none of the fussing about storage latency and block device I/O strategy.
Actually, it doesn't work like that.
Sun's ZFS7420 had a terabyte of RAM per controller, and these work in tandem, and after a certain pressure, the thing can't keep up even though it also uses specialized SSDs to reduce HDD array access during requests, and these were blazingly fast boxes for their time.
When you drive a couple thousand physical nodes with a some-petabytes sized volumes, no amount of RAM can save you. This is why Lustre divides metadata servers and volumes from file ones. You can keep very small files in metadata area (a-la Apple's 0-sized, data-in-resource-fork implementation), but for bigger data, you need to have good filesystems. There are no workarounds from this.
If you want to go faster, take a look at Weka and GPUDirect. Again, when you are pumping tons of data to your GPUs to keep them training/inferring, no amount of RAM can keep that data (or sustain the throughput) during that chaotic access for you.
When we talked about performance, we used to say GB/sec. Now a single SSD provides that IOPS and throughput provided by storage clusters. Instead, we talk about TB/sec in some cases. You can casually connect terabit Ethernet (or Infiniband if you prefer that) to a server with a couple of cables.
7 replies →
Some time ago (back when we were using spinning rust) I was wondering whether one could bypass the latency of disk access when replicating to multiple hosts. I mean, how likely is it, that two hosts crash at the same time? Well, it turns out that there are some causes which take out multiple hosts simultaneously (a way too common occurrence seems to be diesel generators which fail to start after power failure). I think the good fellas at Amazon, Meta and Google even have stories to tell about a whole data center failing. So you need replication across data centers, but then network latency bites ya. Current NVMe storage devices are then faster (and for some access patterns nearly as fast as RAM).
And that's just at the largest scale. I'm pretty sure banks still insist that the data is written to (multiple) disks (aka "stable storage") before completing a transaction.
> Does it really matter in the modern world
Considering that multiple ZFS developers get paid to make ZFS work well on petabyte-sized disk arrays with SSD caching, and one of them often reports on progress in this area in his podcasts (2.5admins.com and bsdnow if you're interested) .. then yes?