← Back to context

Comment by jiggawatts

1 day ago

What I would love to see in an SCM that properly supports large binary blobs is storing the contents using Prolly trees instead of a simple SHA hash.

Prolly trees are very similar to Merkle trees or the rsync algorithm, but they support mutation and version history retention with some nice properties. For example: you always obtain exactly the same tree (with the same root hash) irrespective of the order of incremental edit operations used to get to the same state.

In other words, two users could edit a subset of a 1 TB file, both could merge their edits, and both will then agree on the root hash without having to re-hash or even download the entire file!

Another major advantage on modern many-core CPUs is that Prolly trees can be constructed in parallel instead of having to be streamed sequentially on one thread.

Then the really big brained move is to store the entire SCM repo as a single Prolly tree for efficient incremental downloads, merges, or whatever. I.e.: a repo fork could share storage with the original not just up to the point-in-time of the fork, but all future changes too.

Git has had a good run. Maybe it’s time for a new system built by someone who learned about DX early in their career, instead of via their own bug database.

If there’s a new algorithm out there that warrants a look…

Can you list some realistic workflows where people would be touching the same huge file but only changing much smaller parts of it?

And yes you can represent a whole repo as a giant tar file, but because the boundaries between hash segments won't line up with your file boundaries you get an efficiency hit with very little benefit. Unless you make it file-aware in which case it ends up even closer to what git already does.

Git knows how to store deltas between files. Making that mechanism more reliable is probably able to achieve more with less.

  • Most Microsoft office documents.

    One of our projects has a UI editor with a 60MB file for nearly everything except images, and people work on different UI flows at the same time.

    • So for office, you're looking at files that are archive formats already. Maybe you could improve that a bit, but because of the compression you wouldn't be able to diff text edits better, just save storage. So it would perform about the same as git already does. And you could make it smarter so the prolly tree works better, but you could also make git smarter in the same way, it's not a prolly tree specific optimization.

      For your UI editor I'd need to understand the format more.

  • Binary database files containing “master data”.

    Merging would require support from the DB engine, however.