← Back to context

Comment by bayindirh

1 day ago

Actually, it doesn't work like that.

Sun's ZFS7420 had a terabyte of RAM per controller, and these work in tandem, and after a certain pressure, the thing can't keep up even though it also uses specialized SSDs to reduce HDD array access during requests, and these were blazingly fast boxes for their time.

When you drive a couple thousand physical nodes with a some-petabytes sized volumes, no amount of RAM can save you. This is why Lustre divides metadata servers and volumes from file ones. You can keep very small files in metadata area (a-la Apple's 0-sized, data-in-resource-fork implementation), but for bigger data, you need to have good filesystems. There are no workarounds from this.

If you want to go faster, take a look at Weka and GPUDirect. Again, when you are pumping tons of data to your GPUs to keep them training/inferring, no amount of RAM can keep that data (or sustain the throughput) during that chaotic access for you.

When we talked about performance, we used to say GB/sec. Now a single SSD provides that IOPS and throughput provided by storage clusters. Instead, we talk about TB/sec in some cases. You can casually connect terabit Ethernet (or Infiniband if you prefer that) to a server with a couple of cables.

> When you drive a couple thousand physical nodes with a some-petabytes sized volumes

You aren't doing that with ZFS or btrfs, though. Datacenter-scale storage solutions (c.f. Lustre, which you mention) have long since abandoned traditional filesystem techniques like the one in the linked article. And they rely almost exclusively on RAM behavior for their performance characteristics, not the underlying storage (which usually ends up being something analogous to a pickled transaction log, it's not the format you're expected to manage per-operation)

  • > You aren't doing that with ZFS or btrfs, though.

    ZFS can, and is actually designed to, handle that kind of workloads, though. At full configuration, ZFS7420 is a 84U configuration. Every disk box has its own set of "log" SSDs and 10 additional HDDs. Plus it was one of the rare systems which supported Infiniband access natively, and was able to saturate all of its Infiniband links under immense load.

    Lustre's performance is not RAM bound when driving that kind of loads, this is why MDT arrays are smaller and generally full-flash while OSTs can be selected from a mix of technologies. As I said, when driving that number of clients from a relatively small number of servers, it's not possible to keep all the metadata and query it from the RAM. Yes, Lustre recommends high RAM and core count for servers driving OSTs, but it's for file content throughput when many clients are requesting files, and we're discussing file metadata access primarily.

    • Again I think we're talking past each other. I'm saying "traditional filesystem-based storage management is not performance-limited at scale where everything is in RAM, so I don't see value to optimizations like that". You seem to be taking as a prior that at scale everything doesn't fit in RAM, so traditional filesystem-based storage management is still needed.

      But... everything does fit in RAM at scale. I mean, Cloudflare basically runs a billion dollar business who's product is essentially "We store the internet in RAM in every city". The whole tech world is aflutter right now over a technology base that amounts to "We put the whole of human experience into GPU RAM so we can train our new overlords". It's RAM. Everything is RAM.

      I'm not saying there is "no" home for excessively tuned genius-tier filesystem-over-persistent-storage code. I'm just saying that it's not a very big home, that the market has mostly passed the technology over, and that frankly patches like the linked article seem like a waste of effort to me vs. going to Amazon and buying more RAM.

      3 replies →

  • Besides ZFS (and I've heard of exabyte sized ZFS filesystems), bcachefs is absolutely capable of petabyte sized filesystems.