Comment by tombert
1 day ago
Is Git ever going to get proper support for binary files?
I’ve never used it for anything serious but my understanding is that Mercurial handles binary files better? Like it supports binary diffs if I understand correctly.
Any reason Git couldn’t get that?
I'm not sure binary diffs are the problem - e.g. for storing images or MP3s, binary diffs are usually worse than nothing.
I would think that git would need a parallel storage scheme for binaries. Something that does binary chunking and deduplication between revisions, but keeps the same merkle referencing scheme as everything else.
> binary chunking and deduplication
Are there many binaries that people would store in git where this would actually help? I assume most files end up with compression or some other form of randomization between revisions making deduplication futile.
4 replies →
Xet uses block level deduplication.
> for storing images or MP3s, binary diffs are usually worse than nothing
Editing the ID3 tag of an MP3 file or changing the rating metadata of an image will give a big advantage to block level deduplication. Only a few such cases are needed to more than compensate for that worse than nothing inefficiencies of binary diffs when there's nothing to deduplicate.
A lot of people use Perforce Helix and others use Plastic SCM. That’s been my experience for like large binary assets with git-like functionality
I didn't enjoy using Plastic, but Perforce is ok (not to say that it's perfect - I miss a lot of git stuff). It does have no problems with lots of data though! This article moans about the overhead of a 25 MB png file... it's been a long time since i worked on a serious project where the head revision is less than 25 GB. Typical daily churn would be 2.5 GB+.
(It's been even longer since i used svn in anger, but maybe it could work too. It has file locking, and local storage cost is proportional to size of head revision. It was manageable enough with a 2 GB head revision. Metadata access speed was always terrible though, which was tedious.)
SVN should be able to handle large files no issue imho
My understanding is that git diff algorithms require a file to be segmentable (eg text files are split line-wise) and there is no general segmentation strategy for binary blobs.
But a good segmentation is only good for better compression and nicer diff, git could do byte wise diffs with no issues, so I wonder why doesn't git use customizable segmentation strategies where it calls external tools based on file type (eg a rust thingy for rust file etc, or a PNG thingy for PNG files).
At worst the tool would return either a single segment for the entire file or the byte wise split which would work anyway
A common misconception. git has always used binary deltas for pack files. Consider that git tree objects are themselves not text files, and git needs efficiently store slightly modified versions of the same tree.
All files in git are binary files.
All deltas between versions are binary diffs.
Git has always handled large (including large binary) files just fine.
What it doesn't like is files where a conceptually minor change changes the entire file, for example compressed or encrypted files.
The only somewhat valid complaint is that if someone once committed a large file and then it was later deleted (maybe minutes later, maybe years later) then it is in the repo and in everyone's checkouts forever. Which applies equally to small and to large files, but large ones have more impact.
That's the whole point of a version control system. To preserve the history, allowing earlier versions to be recreated.
The better solution would be to have better review of changes pushed to the master repo, including having unreviewed changes in separate, potentially sacrificial, repos until approved.