Comment by KaiserPro
7 years ago
So, I have mixed feelings.
I don't think in the short term this will be a problem. However IBM has lost it's way. its a very large unwieldy organisation that doesn't change very fast.
It also has an awful lot of lifers, who would flounder horribly outside the soft warm IBM shell.
But, there are some brilliant engineers and technologies that are inside IBM. The ones I know are about are to do with GPFS, which is a shining beacon compared to ceph and gluster.
A clever organisation _could_ mix GPFS, the vast experience with scheduling and resource allocation, and second to none documentation (have you read a Red book? they are marvels of readability.) to make a spectacular platform. One that unlike K8s would be easy to use, understand, tune, script for and run.
They won't be able to execute it properly, but they have the potential.
I definitely agree with you. I’ve been paid more or less to run GPFS over the last 15 years or so (finally free of it for the last few months), and while I hate it with a searing passion, a) it’s far better than the alternatives as long as you can afford it, and b) the people who lead and do the real work on the project are very, very bright. If they keep these people working on hard problems, they’ll do well. If they try to replace them with inexpensive staff, it will fall apart in months.
Parallel POSIX-compliant filesystems at scale are an astonishingly hard problem, and adding the feature set they have while keeping it relatively stable and performant is really worthy of admiration. Every one of the dozens of conversations I’ve had with their developers and technical leads have left me more impressed. They literally can’t test it internally at the scale that their large customers run it, but they’re really good at finding problems and pushing out hotfixes that work pretty well.
That said, if I never have to touch it again for as long as I live, I’ll be a very happy man. From an HPC perspective, I think we’re long past the scale where parallel filesystems should be POSIX-based.
At the risk of topic drift, I'm curious where the problems are with being parallel and POSIX-compliant. I'm at a point where I think I need to consider a parallel file system and I don't have much experience with them, so I'm not aware of the issues.
Basically every time that POSIX makes a guarantee which is difficult to fulfill performantly in a distributed environment.
* Some directory operations require an atomic update to 2 different inodes, confounding sharding strategies and requiring some form of global synchronization.
* write() -- write syscall guarantees that when it returns, the filesystem will serve all future read() calls appropriately, with the contents of the write. This matches up poorly with big streaming writes -- in order to fulfill the spec, you need all these giant pockets of latency waiting for an ACK from the remote host, instead of just streaming it all and getting one ACK at the end.
Here’s something I wrote in this thread:
https://news.ycombinator.com/item?id=15222470
At least on the real big systems, most of our work was with a small number of research groups basically doing the same workflow: start your job, read in some data, crunch, every 15 minutes or something slam the entire contents of system memory (100s of TB) out to spinning disk, crunch, slam, your job gets killed when your time slice expires, get scheduled again, load up the last checkpoint, rinse, repeat. Because of the sheer amount of data, it’s an interesting problem, but you could generally work with the researchers to impose good I/O behavior that gets around the POSIX constraints peculiarities of the particular filesystems. You want 100,000,000 cpu hours on a $200M computer? You can do the work to make the filesystem writes easier on the system.
Coming into private industry was a real eye-opener. You’re in-house staff and you don’t get to say who can use the computer. People use the filesystems for IPC, store 100M files of 200B each, read() and write() terabytes of data 1B at a time, you name it. If I had $100 for every job in which I saw 10,000 cores running a stat() in a while loop waiting for some data to get written to it by one process that had long since died, I’d be retired on a beach somewhere.
The problem with POSIX I/O is that it’s so, so easy and it almost always works when you expect it to. GPFS (what I’m most familiar with) is amazing at enforcing the consistency. I’ve seen parallel filesystems and disk break in every imaginable way and in a lot of ways that aren’t, but I’ve never seen GPFS present data inconsistently across time where some write call was finished and it’s data didn’t show up to a read() started after the write got its lock or a situation where some process opened a file after the unlink was acknowledged. For a developer who hasn’t ever worked with parallel computing and whose boss just wants them to make it work, the filesystem is an amazing tool. I honestly can’t blame a developer who makes it work for 1000 cores and then gets upset with me when it blows up at 1500. I get grouchy with them, but I don’t blame them. (There’s a difference!)
But as the filesystems get bigger, the amount of work the filesystems have to do to maintain that consistency isn’t scaling. The amount of lock traffic flying back and forth between all the nodes is a lot of complexity to keep up with, and if you have the tiniest issue with your network even on some edge somewhere, you’re going to have a really unpleasant day.
One of the things that GCE and AWS have done so well is to just abandon the concept of the shared POSIX filesystem, and produce good tooling to help people deal with the IPC and workflow data processing without it. It’s a hell of a lot of work to go from an on-site HPC environment to GCE though. There’s a ton of money to be made for someone who can make that transition easier and cheaper (if you’ve got it figured out, you know, call me. I want to on it!), but people have sunk so much money into their parallel filesystems and disk that it’s a tough ask for the C-suite. Hypothetically speaking, someone I know really well who’s a lot like me was recently leading a project to do exactly this that got shut down basically because they couldn’t prove it would be cheaper in 3 years.
1 reply →
> They won't be able to execute it properly, but they have the potential.
AKA The last two decades of IBM.
> GPFS, which is a shining beacon compared to ceph and gluster.
Solves an entirely different problem.
Depends on how you use it, and what its for.
_most_ object storage is actually used as a pseudo filesystem (ie S3 et al) because a shared, fast & reliable filesystem are vanishingly rare.
Apart from openstack (and I've never seen a successful deployment of it outside of rackspace) most use cases I've seen involve either bolting on a NFS head, or some other filesystem to ceph and serving it publicly.
which is frankly not all that great. I like the _idea_ of ceph, but I don't want to have to support it. Like Nexenta, it seems great, but it soon hurts during crunch.
What I like about GPFS is that it allows you to join up large amounts of block storage, regardless of the underlying fabric.
Everything has a hook, so if a file has been created/updated/moved/deleted/metadata changed, you can attach a script to that action. There is an inbuilt HSM, which allows you to shuffle files about based on their content: raw footage? move it to the spinny disk array, final deliverables? move it to the storage based in the other country. File bigger than 1TB, and hasn't been touched in two weeks, sure you can kick it out on to tape.
crucially because its all one name space, the end user doesn't have to care about where the file is, the system takes care of that based on rules.
The best part is, there are no special tricks needed for the end program, its just standard file io.
However it is one global system, which is it's downside. for pure uptime its better to have an array of file servers, to limit the blast radius, but then you don't get the goodness
Is there a short summary of which problems are solved by which filesystem?
On the interwebs? No idea. Mine would be:
GPFS is excellent as a clustered filesystem backing a number of servers that need high-throughput, low latency, coherent storage. It's block-oriented.
Ceph is a SAN replacement: can saturate 100 GBps switches with massive parallel throughput. Object based, can also serve up block and (recently, with limits) cooked filesystem.
Gluster is a distributed filesystem - easy to set up and configure, some performance limitations, file rather than object or block oriented.
3 replies →
if gpfs still has NSD limits it ain't gonna work on a cloud scale