Comment by rswail

18 hours ago

Is the problem that we don't have good "diff" equivalents for binaries that git could use to only store those diffs like the old RCS/CVS for large files?

3 comments

rswail

procaryote 18 hours ago

subversion used to do that, actually probably still does... and also only checks out the latest revision. Svn is a bother in other ways of course, like being worse at regular version control, and only usable with access to the server etc.

There's a bunch of binary files that change a lot on small changes due to compression or how the data is serialised, so the problem doesn't go away completely. One could conceivably start handling that, but there are lots of file formats out there and the sum oc complexity tends to be bugs and security issues.

rswail 18 hours ago

Potentially with a new blob type but maintaining a reverse diff would be difficult as it would change the hash of the previous version if you had to store the diff.
Another alternative would be storing the chunks as blobs so that you reconstruct the full binary and only have to store the changed chunks. However that doesn't work with compressed binaries.

IshKebab 15 hours ago

Not really. Git does use delta-based storage for binary files. It might not be as good as it could be for some files (e.g. compressed ones) but that's relatively easy to solve.

The real problem is that Git wants you to have a full copy of all files that have ever existed in the repo. As soon as you add a large file to a repo it's there forever and can basically never be removed. If you keep editing it you'll build up lots more permanent data in the repo.

Git is really missing:

1. A way to delete old data.

2. A way for the repo to indicate which data is probably not needed (old large binaries).

3. A way to serve large files efficiently (from a CDN).

Some of these can sort of be done, but it's super janky. You have to proactively add confusing flags etc.