Comment by brutos
6 years ago
This will be quite bad for reproducible science. Publishing bioinformatics tools as containers was becoming quite popular. Many of these tools have a tiny niche audience and when a scientist wants to try to reproduce some results from a paper published years ago with a specific version of a tool they might be out of luck.
Simplest answer is to release the code with a Dockerfile. Anyone can then inspect build steps, build the resulting image and run the experiments for themselves.
Two major issues I can see are old dependencies (pin your versions!) and out of support/no longer available binaries etc.
In which case, welcome to the world of long term support. It's a PITA.
You can also save the image to a file:
https://docs.docker.com/engine/reference/commandline/image_s...
I would recommend running a registry mirror as it's fairly straightforward.
https://docs.docker.com/registry/recipes/mirror/
3 replies →
That doesn't help against expiring base images though.. :/
Yeah. That’ll be a mess. The way I try to do it is to build an image for a project’s build environment and then use that to build the project. The build env image never changes and stays around forever or as long as is needed. So when you have to patch something that hasn’t been touched for 5 years you can build with the old image instead of doing a big update to the build config of the project.
Many Docker based builds are not reproducible. Even something as simple as apt-get update failing with a zero exit code (it does this) adds complexity and most people don’t bother doing a deep dive.
Personally I use Sonatype Nexus and keep everything important in my own registry. I don’t trust any free offerings unless they’re self hosted.
There needs to be a way to create a combined image of all dependencies to distribute with a Dockerfile and code. That way people could still modify the code and dockerfile.
1 reply →
Tern is designed to help with this sort of thing: https://github.com/tern-tools/tern#dockerfile-lock
It can take a Dockerfile and generate a 'locked' version, with dependencies frozen, so you at least get some reproducibility.
Disclaimer work for VMware; but on a different team.
The Dockerfile should always be published, but it does not enable reproducible builds unless the author is very careful but even so there's no support built in. It would be cool if you could embed hashes into each layer of the Dockerfile, but in practice it's very hard to achieve.
My field is doing something similar.
Reproducible science is definitely a good goal, but reproducible doesn't mean maintainable. Really scientists should be getting in the habit of versioning their code and datasets. Of course a docker container is better than nothing, but I would much rather have a tagged repository and a pointer to an operating system where it compiles.
It's true that many scientists tend to build their results on an ill-defined dumpster fire of a software stack, but the fact that docker lets us preserve these workflows doesn't solve the underlying problem.
FYI, and for anyone else still learning how to version and cite code: Zenodo + GitHub is the most feature rich and user-friendly combination I've found.
https://guides.github.com/activities/citable-code/
Thank you for mentioning Zenodo. I really liked how EU funding agencies push for reproducibility/citability of data and code when you submit proposals to them.
I haven’t filed any NSF stuff (yet) but didn’t come across any such hard requirements where you had to commit to something like zenodo or else to archive the result of your research work for archiving/citations purposes.
1 reply →
Zenodo is great! In theory you could also upload a docker image to Zenodo and give it a DOI, but it doesn't seem to have an especially elegant way to pull this image after the fact.
It seems you simply have to pull it every 5.99 months to not get it removed. So add all your images into a bash script and pull them every couple weeks using crontab and you‘re fine.
On the other side, I see the need for making money and storage/services cannot be free (someone pays somewhere for it - always), but 6 months is not that much for specific usages.
"Pulling docker images every 5 months as a service"
Hey, you could distribute that as a container on Docker hub...
2 replies →
Finally a good use for that raspberry pi idling in the corner
2 replies →
Just ensure you've busted the cache, otherwise you're only pulling a joke.
I'm sure you've cited research older than 5.99 months right?
I wish they would grandfather images before this new ToS to not get wiped so that future images would be uploaded to more stable and accepting platforms so images on Docker Hub from research pre-ToS update don't get wiped.
well it sounds like someone's gotta pony up the bucks for a their own image repo, rather than freeload off someone else's storage infra.
2 replies →
Oh yes I did, some probably as old as myself. Things just don‘t change that much in certain areas.
Why? It'll force a shift to a more elegant and general model of specifying software environments. We shouldn't be relying on specific images but specific reproducible instructions for building images. Relying on a specific Docker image for reproducible science is like relying on hunk of platinum and iridium to know how big a kilogram is: artifact-based science is totally obsolete.
Hummmm, what if the instructions says to get a binary that is been deprecated 5 years ago?
What if it use a patched version of a weird library?
Software preservation is an huge topic and it is not done based on instructions.
The FreeBSD Ports tree specifies package building via reproducible instructions, and handles things like running extra patches for compatibility and security on source distributions. FreeBSD binary packages are simply packaged ports.
Include the patch in the build instructions
There will always be these cases. The issue is that in many fields it is the norm rather than an exception.
I couldn't agree more. The defense of images over instructions to build them has often been "scientists don't work this way", but to me that's either overly cynical or an indication that something is rotting in academic incentive structures.
You could say the same about distributing docker images for deploying code for non-scientific software as well (and honestly, it may very well be true).
But that doesn't change the fact that it's just way easier to skim a paper and pull a docker image than follow every paper's custom build instructions and software stack.
1 reply →
> rotting
I would not say rotting. From my perspective, the academic community has always lagged behind engineering best practices (except in their specific fields).
These reproducible instructions you speak of are already present in Dockerfiles.
It seems like you're arguing against using docker images, when docker builds solve the very issue you speak of.
Correct me if I'm wrong...?
A Dockerfile is not a reproducible set of build instructions in most cases. I'd guess that the vast majority of Dockerfiles are not reproducible.
Let's look at an example dockerfile for redis (based on [0])
(Note, modified from upstream for this example; won't actually build)
The unreproducible bits are the following:
1. FROM debian:buster-slim -- unreproducible, the base image may change
2. apt-get update && apt-get install -- unreproducible, will give a different version of gcc and other apt packages at different times
Those two bits of unreprodicble-ness are so core to the image, that they result in every other step not being reproducible either.
As a result, when you 'docker build' that over time, it's very unlikely you'll get a bit-for-bit identical redis binary at the other end. Even a minor gcc version change will likely result in a different binary.
As a contrast to this, let's look at a reproducible build of redis using nix. In nixpkgs, it looks like so [1].
If I want a reproducible shell environment, I simply have to pin down its dependencies, which can be done by the following:
If I distribute that nix expression, and say "I ran it with nix version 2.3", that is sufficient for anyone to get a bit-for-bit identical redis binary. Even if the binary cache (which lets me not compile it) were to go away, that nixpkgs revision expresses the build instructions, including the exact version of gcc. Sure, if the binary cache were deleted, it would take multiple hours for everything to compile, but I'd still end up with a bit-for-bit identical copy of redis.
This is true of the majority of nix packages. All commands are run in a sandbox with no access to most of the filesystem or network, encouraging reproducibility. Network access is mediated by special functions (like fetchTarball and fetchGit) which require including a sha256.
All network access going through those specially denoted means of network IO means it's very easy to back up all dependencies (i.e. the redis source code referenced in [1]), and the sha256 means it's easy to use mirrors without having to trust them to be unmodified.
It's possible to make an unreproducible nix package, but it requires going out of your way to do so, and rarely happens in practice. Conversely, it's possible to make a reproducible dockerfile, but it requires going out of your way to do so, and rarely happens in practice.
Oh, and for bonus points, you can build reproduible docker images using nix. This post has a good intro to how to play with that [2].
[0]: https://github.com/docker-library/redis/blob/bfd904a808cf68d...
[1]: https://github.com/NixOS/nixpkgs/blob/a7832c42da266857e98516...
[2]: https://christine.website/blog/i-was-wrong-about-nix-2020-02...
5 replies →
Maybe they should switch to Github. https://github.com/features/packages
Or store the containers in the Internet Archive alongside the paper. They’re just tarballs. Lots of options as long as you're comfortable with object storage.
This still means that tools published in the last few years until now might just be gone soon. The people who uploaded the images might have graduated or moved on and none will be there to save the work.
9 replies →
quay is another alternative.
Publishing containers to GitHub might be free but you have to login to GitHub to download the containers from free accounts, significantly hampering end-user usability compared to Docker Hub, particularly if 2FA authentication is enabled on a GitHub account. As mentioned elsewhere Quay.io might be another alternative.
We (the GitHub Packages team where I work) are working on a fix for this and a number of issues with the current docker service. You can join the beta too, details here https://github.com/containerd/containerd/issues/3291#issueco...
You don't need to register an SSH key to download a public repo I thought
6 replies →
GitHub storage for docker images is very expensive relative to free: I don’t think it’s a viable solution in this case.
They should be using Nix or similar then. The typical Dockerfile is not reproducible.
As long as the Dockerfile is released alongside, this should not be an issue.
I don't see any valid reason why anyone would upload and share a public docker image but not its Dockerfile and therefore do not pull anything from Dockerhub that doesn't also have the Dockerfile on the Dockerhub page.
Dockerfiles are not guaranteed to be reproducible. They can run arbitrary logic which can have arbitrary side-effects. A classic is `wget https://example.com/some-dependency/download/latest.tgz`.
What about when the image that it is based on goes out of date and is pruned too?
This is part of why I tend to only use images that only build from a small set of well-established base images like scratch, alpine, debian and occasionally ubuntu. Those base images can also be handled in the same way. For any exception, you can always do the same.
A bonus to this is that you no longer have the risks of systems breaking because of Dockerhub or quay.io (which I haven't seen mentioned here yet, btw) being offline.
Couldn't journals host the images? Or some university affiliated service, let us call it "dockXiv"?
Having the images on dockerhub is more convenient, but as long as the paper says where to find the image this does not seem that bad.