← Back to context

Comment by lmm

4 days ago

> Which will reproduce the environment exactly, it's just not a guidebook on how it was built.

By that logic every binary artifact is a "reproducible build". The point of reproducibility isn't just to be able to reproduce the exact same artifact, it's to be able to make changes that have predictable effects.

> The value of most reproducibility at the Dockerfile is that we're actually agnostic to getting a byte-exact reproduction: what we want is the ability to record what was important and effect upgrades.

More or less true. But we don't have that, because of what grandparent said; if a Dockerfile used to work and now doesn't, and there's an apt-get update in it, who knows what version it was getting back when it was working, or how to fix the problem?

I do get the theoretical annoyance of how it’s technically not reproducible, but in practice most containers are pulled and not built from scratch. If you’re really concerned about that apt-get then besides a container registry you’re going to host a private package repository too, or install a versioned tarball from a public URL, but check the hash of whatever you’re downloading and put that hash in the dockerfile.

So in practice.. if the build described in the dockerfile breaks, you notice when you’re changing / extending the dockerfile.. which is the time and place where you’d expect to need to know. My guess is that most people complaining about deterministic builds for containers are not using registries for storing images, and are not deploying to platforms like k8s. If your process is, say, shipping dockerfiles to EC2 and building them in situ with “compose up” or something, then of course it won’t be very deterministic and you’re at the mercy of many more network failures, etc

  • > If you’re really concerned about that apt-get then besides a container registry you’re going to host a private package repository too, or install a versioned tarball from a public URL, but check the hash of whatever you’re downloading and put that hash in the dockerfile.

    Right, but if you're doing that then you probably don't need to bother with the docker part at all.

    > you notice when you’re changing / extending the dockerfile.. which is the time and place where you’d expect to need to know

    You notice, sure, but you can't see what it was. Like, sure, it's better than it failing when you come to deploy it, but you're still in the position of having to do software archaeology to figure out how it ever worked in the first place.

    Like, for me the main use case for reproducible builds is "I need to make a small change to this component that was made by someone who left the company 3 years ago and has been quietly running since then", and you want to be able to just run the build and be confident it's going to work. You don't necessarily need to build something byte-for-byte identical to the thing that's currently running, but you do need to build something equivalent. The reproducibility isn't important per se but the declarativeness is, and with Docker you don't get that.

  • The issue is it's why are you trying to be reproducible. The best use case is proving authenticity: that the source code became the binary code as written, but we're so far away from that that it's not realistic.

    My dream system would be CI which gives me a gigantic object graph and can sign the source code from the ground up for every single thing including the compiler, so when a change happens you can drill down to what changed, and what the diffs were.

None of this has anything to do with Dockerfile but the tools used within.

Nix provides the tooling to do reproducible builds. Meanwhile docker is a wrapper around the tools you choose.

Also just to note, docker does allow you disable network access during builds. Beyond Dockerfile, which is a high level DSL, the underlying tech can do this per build step (in buildkit LLB).