Comment by ks2048

1 day ago

I'm not sure binary diffs are the problem - e.g. for storing images or MP3s, binary diffs are usually worse than nothing.

8 comments

ks2048

digikata 1 day ago

I would think that git would need a parallel storage scheme for binaries. Something that does binary chunking and deduplication between revisions, but keeps the same merkle referencing scheme as everything else.

tempay 1 day ago
> binary chunking and deduplication
Are there many binaries that people would store in git where this would actually help? I assume most files end up with compression or some other form of randomization between revisions making deduplication futile.
- adastra22 1 day ago
  
  A lot in the game and visual art industries.
- zigzag312 17 hours ago
  
  2-3x reduction in repository size compared to Git LFS in this test:
  https://xethub.com/blog/benchmarking-the-modern-development-...
- digikata 1 day ago
  
  I don't know, it's all probability in the dataset that makes one optimization strategy better over another. Git annex iirc does file level dedupe. That would take care of most of the problem if you're storing binaries that are compressed or encrypted. It's a lot of work to go beyond that, and probably one reason no one has bothered with git yet. But borg and restic both do chunked dedupe I think.
- hinkley 1 day ago
  
  It would likely require more tooling.
zigzag312 17 hours ago

Xet uses block level deduplication.

zigzag312 15 hours ago

> for storing images or MP3s, binary diffs are usually worse than nothing

Editing the ID3 tag of an MP3 file or changing the rating metadata of an image will give a big advantage to block level deduplication. Only a few such cases are needed to more than compensate for that worse than nothing inefficiencies of binary diffs when there's nothing to deduplicate.