← Back to context

Comment by davidmr

7 years ago

Here’s something I wrote in this thread:

https://news.ycombinator.com/item?id=15222470

At least on the real big systems, most of our work was with a small number of research groups basically doing the same workflow: start your job, read in some data, crunch, every 15 minutes or something slam the entire contents of system memory (100s of TB) out to spinning disk, crunch, slam, your job gets killed when your time slice expires, get scheduled again, load up the last checkpoint, rinse, repeat. Because of the sheer amount of data, it’s an interesting problem, but you could generally work with the researchers to impose good I/O behavior that gets around the POSIX constraints peculiarities of the particular filesystems. You want 100,000,000 cpu hours on a $200M computer? You can do the work to make the filesystem writes easier on the system.

Coming into private industry was a real eye-opener. You’re in-house staff and you don’t get to say who can use the computer. People use the filesystems for IPC, store 100M files of 200B each, read() and write() terabytes of data 1B at a time, you name it. If I had $100 for every job in which I saw 10,000 cores running a stat() in a while loop waiting for some data to get written to it by one process that had long since died, I’d be retired on a beach somewhere.

The problem with POSIX I/O is that it’s so, so easy and it almost always works when you expect it to. GPFS (what I’m most familiar with) is amazing at enforcing the consistency. I’ve seen parallel filesystems and disk break in every imaginable way and in a lot of ways that aren’t, but I’ve never seen GPFS present data inconsistently across time where some write call was finished and it’s data didn’t show up to a read() started after the write got its lock or a situation where some process opened a file after the unlink was acknowledged. For a developer who hasn’t ever worked with parallel computing and whose boss just wants them to make it work, the filesystem is an amazing tool. I honestly can’t blame a developer who makes it work for 1000 cores and then gets upset with me when it blows up at 1500. I get grouchy with them, but I don’t blame them. (There’s a difference!)

But as the filesystems get bigger, the amount of work the filesystems have to do to maintain that consistency isn’t scaling. The amount of lock traffic flying back and forth between all the nodes is a lot of complexity to keep up with, and if you have the tiniest issue with your network even on some edge somewhere, you’re going to have a really unpleasant day.

One of the things that GCE and AWS have done so well is to just abandon the concept of the shared POSIX filesystem, and produce good tooling to help people deal with the IPC and workflow data processing without it. It’s a hell of a lot of work to go from an on-site HPC environment to GCE though. There’s a ton of money to be made for someone who can make that transition easier and cheaper (if you’ve got it figured out, you know, call me. I want to on it!), but people have sunk so much money into their parallel filesystems and disk that it’s a tough ask for the C-suite. Hypothetically speaking, someone I know really well who’s a lot like me was recently leading a project to do exactly this that got shut down basically because they couldn’t prove it would be cheaper in 3 years.

This is a lovely wedge of experience.

You are of course very correct that if you have a massive number of processes using files for IPC, things will fail in very strange ways.

One of the very positive things that AWS/GCE provide is simple scalable primitives, with obvious metrics to measure limits. For example SQS is a brilliant primitive for many-many message based processing. It avoids the horror of running a HA kafka queue(or worse).