← Back to context

Comment by quotemstr

6 years ago

Why? It'll force a shift to a more elegant and general model of specifying software environments. We shouldn't be relying on specific images but specific reproducible instructions for building images. Relying on a specific Docker image for reproducible science is like relying on hunk of platinum and iridium to know how big a kilogram is: artifact-based science is totally obsolete.

Hummmm, what if the instructions says to get a binary that is been deprecated 5 years ago?

What if it use a patched version of a weird library?

Software preservation is an huge topic and it is not done based on instructions.

  • The FreeBSD Ports tree specifies package building via reproducible instructions, and handles things like running extra patches for compatibility and security on source distributions. FreeBSD binary packages are simply packaged ports.

  • There will always be these cases. The issue is that in many fields it is the norm rather than an exception.

I couldn't agree more. The defense of images over instructions to build them has often been "scientists don't work this way", but to me that's either overly cynical or an indication that something is rotting in academic incentive structures.

  • You could say the same about distributing docker images for deploying code for non-scientific software as well (and honestly, it may very well be true).

    But that doesn't change the fact that it's just way easier to skim a paper and pull a docker image than follow every paper's custom build instructions and software stack.

    • Why would build instructions have to be custom? Making a reproducible image should be as easy as getting a docker image

  • > rotting

    I would not say rotting. From my perspective, the academic community has always lagged behind engineering best practices (except in their specific fields).

These reproducible instructions you speak of are already present in Dockerfiles.

It seems like you're arguing against using docker images, when docker builds solve the very issue you speak of.

Correct me if I'm wrong...?

  • A Dockerfile is not a reproducible set of build instructions in most cases. I'd guess that the vast majority of Dockerfiles are not reproducible.

    Let's look at an example dockerfile for redis (based on [0])

        FROM debian:buster-slim
        RUN apt-get update; apt-get install -y --no-install-recommends gcc
        RUN wget http://download.redis.io/releases/redis-6.0.6.tar.gz && tar xvf redis* && cd redis-6.0.6 && make install
    

    (Note, modified from upstream for this example; won't actually build)

    The unreproducible bits are the following:

    1. FROM debian:buster-slim -- unreproducible, the base image may change

    2. apt-get update && apt-get install -- unreproducible, will give a different version of gcc and other apt packages at different times

    Those two bits of unreprodicble-ness are so core to the image, that they result in every other step not being reproducible either.

    As a result, when you 'docker build' that over time, it's very unlikely you'll get a bit-for-bit identical redis binary at the other end. Even a minor gcc version change will likely result in a different binary.

    As a contrast to this, let's look at a reproducible build of redis using nix. In nixpkgs, it looks like so [1].

    If I want a reproducible shell environment, I simply have to pin down its dependencies, which can be done by the following:

        let
          pkgs = import (builtins.fetchTarball {
            url = "https://github.com/NixOS/nixpkgs/archive/48dfc9fa97d762bce28cc8372a2dd3805d14c633.tar.gz";
            sha256 = "0mqq9hchd8mi1qpd23lwnwa88s67ac257k60hsv795446y7dlld2";
          }) {};
        in pkgs.mkShell {
          buildInputs = [ pkgs.redis];
        }
    

    If I distribute that nix expression, and say "I ran it with nix version 2.3", that is sufficient for anyone to get a bit-for-bit identical redis binary. Even if the binary cache (which lets me not compile it) were to go away, that nixpkgs revision expresses the build instructions, including the exact version of gcc. Sure, if the binary cache were deleted, it would take multiple hours for everything to compile, but I'd still end up with a bit-for-bit identical copy of redis.

    This is true of the majority of nix packages. All commands are run in a sandbox with no access to most of the filesystem or network, encouraging reproducibility. Network access is mediated by special functions (like fetchTarball and fetchGit) which require including a sha256.

    All network access going through those specially denoted means of network IO means it's very easy to back up all dependencies (i.e. the redis source code referenced in [1]), and the sha256 means it's easy to use mirrors without having to trust them to be unmodified.

    It's possible to make an unreproducible nix package, but it requires going out of your way to do so, and rarely happens in practice. Conversely, it's possible to make a reproducible dockerfile, but it requires going out of your way to do so, and rarely happens in practice.

    Oh, and for bonus points, you can build reproduible docker images using nix. This post has a good intro to how to play with that [2].

    [0]: https://github.com/docker-library/redis/blob/bfd904a808cf68d...

    [1]: https://github.com/NixOS/nixpkgs/blob/a7832c42da266857e98516...

    [2]: https://christine.website/blog/i-was-wrong-about-nix-2020-02...

    • Unless something changed in the months since I have used Nix, this will not get you bit-for-bit reproducible builds. Nix builds its hash tree from the source files of your package and the hashes of its dependencies. The build output is not considered at any step of process.

      I was under the impression that Nix also wants to provide bit-for-bit reproducible builds, but that that is a much longer term goal. The immediate value proposition of Nix is ensuring that your source and your dependencies' source are the same.

      3 replies →

    • Exactly. Basically, if your product needs network access during build, you don't have a reproducible build, and if you don't have a reproducible build, it's only a matter of time before something goes horribly wrong.