Comment by jiggawatts

1 day ago

What I would love to see in an SCM that properly supports large binary blobs is storing the contents using Prolly trees instead of a simple SHA hash.

Prolly trees are very similar to Merkle trees or the rsync algorithm, but they support mutation and version history retention with some nice properties. For example: you always obtain exactly the same tree (with the same root hash) irrespective of the order of incremental edit operations used to get to the same state.

In other words, two users could edit a subset of a 1 TB file, both could merge their edits, and both will then agree on the root hash without having to re-hash or even download the entire file!

Another major advantage on modern many-core CPUs is that Prolly trees can be constructed in parallel instead of having to be streamed sequentially on one thread.

Then the really big brained move is to store the entire SCM repo as a single Prolly tree for efficient incremental downloads, merges, or whatever. I.e.: a repo fork could share storage with the original not just up to the point-in-time of the fork, but all future changes too.

9 comments

jiggawatts

hinkley 1 day ago

Git has had a good run. Maybe it’s time for a new system built by someone who learned about DX early in their career, instead of via their own bug database.

If there’s a new algorithm out there that warrants a look…

viraptor 1 day ago
Jujutsu unfortunately doesn't have any story for large files yet (as far as I can tell), but maybe soon ...
- martinvonz 1 day ago
  
  That's correct. It's only on our roadmap so far (https://jj-vcs.github.io/jj/latest/roadmap/#better-support-f...).
  We have also talked about doing something similar for tree objects in order to better support very large directories (to reduce the amount of data we need to transfer for them) and very deep directories (to reduce the number of roundtrips to the server). I think we have only talked about that on Discord so far (https://discord.com/channels/968932220549103686/969291218347...). It would not be compatible with Git repos, so it would only really be useful to teams outside Google once there's an external jj-native forge that decides to support it (if our rough design is even realistic).
  
  2 replies →

Dylan16807 20 hours ago

Can you list some realistic workflows where people would be touching the same huge file but only changing much smaller parts of it?

And yes you can represent a whole repo as a giant tar file, but because the boundaries between hash segments won't line up with your file boundaries you get an efficiency hit with very little benefit. Unless you make it file-aware in which case it ends up even closer to what git already does.

Git knows how to store deltas between files. Making that mechanism more reliable is probably able to achieve more with less.

bobmcnamara 11 hours ago
Most Microsoft office documents.
One of our projects has a UI editor with a 60MB file for nearly everything except images, and people work on different UI flows at the same time.
- Dylan16807 3 hours ago
  
  So for office, you're looking at files that are archive formats already. Maybe you could improve that a bit, but because of the compression you wouldn't be able to diff text edits better, just save storage. So it would perform about the same as git already does. And you could make it smarter so the prolly tree works better, but you could also make git smarter in the same way, it's not a prolly tree specific optimization.
  For your UI editor I'd need to understand the format more.
jiggawatts 20 hours ago

Binary database files containing “master data”.
Merging would require support from the DB engine, however.