← Back to context

Comment by ahmedfromtunis

2 years ago

Stupid question as I never worked on something like this before: why isn't reproducibility the default behavior?

I mean if 2 copies of a piece of software were compiled from the same source, what stops them from being identical each and every time?

I know there are so many moving parts, but I still can't understand how discrepancies can manifest themselves.

There are many specific causes, time stamps probably being the most common issue. You can see a list of common issues here:

https://reproducible-builds.org/docs/

The main overall issue is that developers don't test to ensure they reproduce. Once it's part of the release tests it tends to stay reproducible.

  • I agree, although I wouldn't describe the overall issue as developers not testing to ensure reproducibility. The reason most builds aren't reproducible is that build reproducibility isn't a goal for most projects.

    It would be great if 100% of builds were reproducible, but I don't believe developers shouldn't be testing for reproducibility unless it's a defined goal.

    As generalized reproducible build tooling (guix, nix, etc.) becomes more mainstream, I imagine we'll see more reproducible builds as adoption grows and reproducibility is no longer something developers have to "check for", but simply rely upon from their tooling.

    • It's also because the cost of making things reproducible is still too high.

      We have the tooling, but it still takes a bit of effort from the developer's side to integrate those into their CI pipeline.

      Eventually we will get to a place where this will be the default. It will be integrated into day-to-day tooling like `cargo release`, `npm publish`, ...

Loads of things. Obvious ones where the decision is explicitly taken to be non-reproducible include timestamps and authorship information. There are also other places where reproducibility is implicitly broken by default: e.g. many runtimes don't define the order of entries in a hashmap, and then the compiler iterates over a hashmap to build the binary.

  • I can see why devs would want "This Software was built on 10/10/2007 by bob7 from git hash aaffaaff" to appear on the splash screen of software.

    How do you get similar behaviour while having a reproducible build?

    Can you, for example, have the final binary contain a reproducible part, and another section of the elf file for deliberately non-reproducible info?

    • if you have a reproducible build, then the notion of "software was built on date by user" is kind of useless information, no? Because it does not matter - if you can verify that a specific git hash of a codebase results in a particular binary through reproducible builds, a malicious adversary could have built it yesterday and given it to me and i can be almost surely confident (barring hash-collisions...) it's identical to a known trusted team member building it.

      Having information about which git has was used, as well as the time it was published, is part of the source distribution so an output can contain references to these inputs and still be deterministic w.r.t. those inputs.

      If you REALLY want to know when/who built something, you could add in an auxiliary source file which contains that information, which is required to build. Which is essentially what compilers which leverage current time do anyway, it's just implicit.

      6 replies →

    • Yeah, "who built this" information belongs in a signing certificate that accompanies the build artefact, not in the artefact itself. The Git hash can certainly appear in the binary (it's a reproducible part of the build input), and the date can instead be e.g. the commit date, which is probably more relevant to a user anyway.

      3 replies →

    • You can still include the git hash or a git tag/release version info, since the reproducer has the same git repo anyway.

      But including timestamp of build would necessitate “spoofing” the timestamp by the reproducer to be the same as the original.

Parallelism. There might be actions that are not order-independent, and the state of the CPU might result in slightly different binaries, but all are correct.

  • Why does this matter though? Why does order of compilation result in a different binary?

    • Just some random, made up example: say you want to compile an OOP PL that has interfaces and implementations of that. You discover reachable implementations through static analysis, which is multi-threaded. You might discover implementations A,B,C in any order — but they will get their methods placed in the jump table based on this order. This will trivially result in semantically equivalent, but not binary-equivalent executables.

      Of course there would have been better designs for this toy example, but binary reproducibility is/was usually not of the highest priority historically in most compiler infrastructures, and in some cases it might be a relatively big performance regression to fix, or simply just a too big refactor.

    • Because order of completion of the parallel tasks is not guaranteed, if all tasks write to the same file you might get a different result each time.

  • > There might be actions that are not order-independent, and the state of the CPU might result in slightly different binaries, but all are correct.

    Well no: that's really the thing reproducible packages are showing: there's only one correct binary.

    And it's the one that's 100% reproducible.

    I'd even say that that's the whole point: there's only one correct binary.

    I'll die on the hill that if different binaries are "all correct", then none are: for me they're all useless if they're not reproducible.

    And it looks like people working on entire .iso being fully bit-for-bit reproducible are willing to die on that hill too.

    • "Correct" does not mean "reproducible" just because you think lowly of irreproducible builds.

      A binary consisting of foo.o and bar.o is correct whether foo.o was linked before bar.o or vice versa, provided that both foo.o and bar.o were compiled correctly.

    • See my reply to the sibling post — binary reproducibility is not the end goal. It is an important property, and I do agree that most compiler toolchains should strive for that, but e.g. it might not be a priority for, say, a JIT compiler.

Sometimes it's randomized algorithms, sometimes it's performance (e.g. it might be faster not to sort something), sometimes it's time or environment-dependent metadata, sometimes it's thread interleaving, etc.

  • a very common one is pointer values being different from run to run and across different operating systems. Any code that intentionally or accidentally relies on pointer values will be non-deterministic

A surprising amount of compiler and program behavior depends on how pointer values compare.

These comparisons don't have to go the same way for everything to be correct.

I don't develop enough to give a particularly good answer, but one example I've heard of involves timestamps

Imagine the program uses the current date or time as a value. When compiled at different moments, the bits change.

Same applies to anything where the build environment or timing influences the output binary

Laziness and carelessness of compiler developers.

  • As others have mentioned, there’s sorting issues (are directory entries created in the same order for a project that compiled everything in a directory?), timestamps (archive files and many other formats embed timestamps), and things that you really want to be random (tmpdir on Linux [at least in the past] would create directories of varying length).

    I’ve successfully built tools to compare Java JARs that required getting around two of those and other test tools that required the third. I’m sure there are more.