Comment by XorNot

6 months ago

You can just store the actual container though. Which will reproduce the environment exactly, it's just not a guidebook on how it was built.

The value of most reproducibility at the Dockerfile is that we're actually agnostic to getting a byte-exact reproduction: what we want is the ability to record what was important and effect upgrades.

10 comments

XorNot

lmm 6 months ago

> Which will reproduce the environment exactly, it's just not a guidebook on how it was built.

By that logic every binary artifact is a "reproducible build". The point of reproducibility isn't just to be able to reproduce the exact same artifact, it's to be able to make changes that have predictable effects.

> The value of most reproducibility at the Dockerfile is that we're actually agnostic to getting a byte-exact reproduction: what we want is the ability to record what was important and effect upgrades.

More or less true. But we don't have that, because of what grandparent said; if a Dockerfile used to work and now doesn't, and there's an apt-get update in it, who knows what version it was getting back when it was working, or how to fix the problem?

photonthug 6 months ago
I do get the theoretical annoyance of how it’s technically not reproducible, but in practice most containers are pulled and not built from scratch. If you’re really concerned about that apt-get then besides a container registry you’re going to host a private package repository too, or install a versioned tarball from a public URL, but check the hash of whatever you’re downloading and put that hash in the dockerfile.
So in practice.. if the build described in the dockerfile breaks, you notice when you’re changing / extending the dockerfile.. which is the time and place where you’d expect to need to know. My guess is that most people complaining about deterministic builds for containers are not using registries for storing images, and are not deploying to platforms like k8s. If your process is, say, shipping dockerfiles to EC2 and building them in situ with “compose up” or something, then of course it won’t be very deterministic and you’re at the mercy of many more network failures, etc
- lmm 6 months ago
  
  > If you’re really concerned about that apt-get then besides a container registry you’re going to host a private package repository too, or install a versioned tarball from a public URL, but check the hash of whatever you’re downloading and put that hash in the dockerfile.
  Right, but if you're doing that then you probably don't need to bother with the docker part at all.
  > you notice when you’re changing / extending the dockerfile.. which is the time and place where you’d expect to need to know
  You notice, sure, but you can't see what it was. Like, sure, it's better than it failing when you come to deploy it, but you're still in the position of having to do software archaeology to figure out how it ever worked in the first place.
  Like, for me the main use case for reproducible builds is "I need to make a small change to this component that was made by someone who left the company 3 years ago and has been quietly running since then", and you want to be able to just run the build and be confident it's going to work. You don't necessarily need to build something byte-for-byte identical to the thing that's currently running, but you do need to build something equivalent. The reproducibility isn't important per se but the declarativeness is, and with Docker you don't get that.
- XorNot 6 months ago
  
  The issue is it's why are you trying to be reproducible. The best use case is proving authenticity: that the source code became the binary code as written, but we're so far away from that that it's not realistic.
  My dream system would be CI which gives me a gigantic object graph and can sign the source code from the ground up for every single thing including the compiler, so when a change happens you can drill down to what changed, and what the diffs were.
  
  1 reply →
cpuguy83 6 months ago

None of this has anything to do with Dockerfile but the tools used within.
Nix provides the tooling to do reproducible builds. Meanwhile docker is a wrapper around the tools you choose.
Also just to note, docker does allow you disable network access during builds. Beyond Dockerfile, which is a high level DSL, the underlying tech can do this per build step (in buildkit LLB).

edude03 6 months ago

I'm not talking about a bit perfect reproduction though, just being able to understand dependencies. Take for example a simple Dockerfile like

``` FROM python:latest ADD . RUN pip install foo ```

If I run this today, and I run this a year from now, I'm going to different versions of `python` and `foo` and there is no way (with just the Dockerfile) to know which version of `foo` and `python` were intended.

Nix on the other hand, forces me to use a git sha[^0] of my dependency; there is no concept of a mutable input. So to your point it's hard to 'upgrade' from version a -> b in a controlled fashion if you don't know what `a` even was.

[0]: or the sha256 of the depedency which yes, I understand that's not easy for humans to use.

ofrzeta 6 months ago
Well, what about "FROM python:3.18" and using requirements.txt or something like that? I mean, running an arbitrary Python version will get you in trouble anyway.
- viraptor 6 months ago
  
  Depends on the repository used, actual version revision of Python, compiled features, the way the requirements file is written, the way the current version of pip resolves them, the base is that Python image and a lot of other things. Doing that gives you maybe 30% of the way towards something reproducible and consistent.
- neobrain 6 months ago
  
  There's no mechanism to enforce this is done consistently. With nix, there is.
  The degree to which this guarantee is useful or necessary depends on your use case.