Comment by ses1984

8 days ago

The previous post describes a problem where you do a large docker build, then fan out to many jobs which need to pull this image, and the overhead is enormous. This implies rwx has less overhead. Just saying that there’s content addressable cache doesn’t explain how this particular problem is solved.

If you have a dockerfile where you make a small change in your source results in one particular very large layer that has to be built, then you want to fan out and run many parallel tests using that image, what actually happens when you try to run that new fat layer on a bunch of compute, and how is it better than the implied naive solution? That fat layer exists on a storage system somewhere, and a bunch of computer nodes need to read it, what happens?

1 comment

ses1984

tagraves 7 days ago

There's three main things we do to solve this, all of which relate to the fact that we have our own (OCI-compatible) container runtime under the hood instead of using Docker.

1. We don't gzip layers like Docker does. Gzip is really slow, and it's much slower than the network. Storage is cheap. So it's much faster to transmit uncompressed layers than to transmit compressed layers and decompress them.

2. We've heavily tuned our agents for pulling layers fast. Disk throughput and IOPS are really important so we provision those higher than you typically would for running workloads in the cloud. When pulling layers we modify kernel parameters like the dirty_ratio to values that we've empirically found with layer pulls. We make sure we completely exhaust our network bandwidth and throughput when pulling layers. And so on.

3. This third one is experimental and something we're actively working on improving, but we have our own underlying filesystem which lazily loads the files from a layer instead of pulling tons of (potentially unneeded) files up front. This is similar to AWS's [Seekable OCI](https://github.com/awslabs/soci-snapshotter) but tuned for our particular needs.

I've been slowly working on improving our documentation to explain these kinds of differentiators that our architecture and container runtime provide, but most of it is unpublished so far. We definitely need to do a much better job of explaining _how_ we are faster and better rather than just stating it :).

The other side of this is that we also made _building_ those layers much much faster. We blogged a little bit about it at https://www.rwx.com/blog/we-deleted-our-dockerfiles but just to hit some quick notes: in RWX you can vary the compute by task, and it turns out throwing a big machine at (e.g.) `npm install` is quite effective. Plus we make using an incremental cache very easy, and layers generated from an incremental cache are only the incremental parts, so they tend to be smaller. And we're a DAG, so you can parallelize your setup in a way that is very painful to do with Docker, even when using multi-stage builds. And our cache registry is global and very hard to mess up, whereas a lot of people misconfigure their Docker caches and have cache misses all over their docker builds. And we have miss-then-hit semantics for caching. Okay, I'm rambling now! But happy to go into more depth on any of this!