Comment by GordonS
6 years ago
Yes, that's what I meant when I mentioned layers; clearly copies of the same layer are not kept :) My question was about block-level, or other forms of deduplication.
6 years ago
Yes, that's what I meant when I mentioned layers; clearly copies of the same layer are not kept :) My question was about block-level, or other forms of deduplication.
Deduplication at the block level would be dependent on the choice of storage driver (https://docs.docker.com/registry/storage-drivers/). In the case of Hub, S3 is the storage medium and that's an object store rather than a block store.
In theory you could modify the spec/application to try to break layers down into smaller pieces but I have a feeling you would reach the point of diminishing returns for normal use cases pretty quickly.
I found this recent paper interesting: https://www.usenix.org/conference/atc20/presentation/zhao
> Containers are increasingly used in a broad spectrum of applications from cloud services to storage to supporting emerging edge computing paradigm. This has led to an explosive proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of registries, which store and serve images. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the severe storage requirements of the growing registries. However, existing deduplication techniques largely degrade the performance of registry because of layer restore overhead. In this paper, we propose DupHunter, a new Docker registry architecture, which not only natively deduplicates layer for space savings but also reduces layer restore overhead. DupHunter supports several configurable deduplication modes , which provide different levels of storage efficiency, durability, and performance, to support a range of uses. To mitigate the negative impact of deduplication on the image download times, DupHunter introduces a two-tier storage hierarchy with a novel layer prefetch/preconstruct cache algorithm based on user access patterns. Under real workloads, in the highest data reduction mode, DupHunter reduces storage space by up to 6.9x compared to the current implementations. In the highest performance mode, DupHunter can reduce the GET layer latency up to 2.8x compared to the state-of-the-art.
This is really interestingm thanks for posting this! It's exactly the kind of thing I was thinking of, even if I expected a comment like yours to come from someone at Docker ;)