Comment by maxfan8

6 years ago

Yep, just mentioned it to the Archive Team IRC. We're probably going to selectively archive particular Docker images, although that's a lot of manual labor.

If you have any ideas wrt to selecting important images, that'd be great.

7 comments

maxfan8

thebouv 6 years ago

Rough idea: maintain an Awesome List of images worth saving, take submissions from public, use that list to automate what to pull?

maxfan8 6 years ago
Yeah, good idea — I’m not in these fields so it’s difficult for me to judge. Also, it sounds like we should be prioritizing niche images that only a handful of papers use rather than images that people rely upon regularly.
- cosmie 6 years ago
  
  Couldn't you bootstrap a list by searching/parsing the Archive dataset itself? Searching for
  A) "docker pull" commands and parsing the text that comes after it based on the command's syntax[1] to extract instructional references to images such as "docker pull ubuntu:latest, and
  B) Searching for links/text beginning with "https://hub.docker.com/_/" to identify informational references to image base pages such as (https://hub.docker.com/_/ubuntu)
  [1] https://docs.docker.com/engine/reference/commandline/pull/
  
  1 reply →

contravariant 6 years ago

Since images tend to be based on each other I wonder if someone's analyzed the corresponding dependency graph yet. In theory you should get quite far if you isolate the most commonly used base images.

CameronNemo 6 years ago
Are those not the images that are basically guaranteed to stay in Dockerhub?
- toomuchtodo 6 years ago
  
  “Guaranteed” is a strong word.