Comment by gavinhoward
8 hours ago
As someone who has thought a lot about VCS design [1] [2], the chunking approach is the wrong one and will still waste space.
[1]: https://gavinhoward.com/uploads/designs/yore.md
[2]: My WIP VCS has been named Yore for at least two years; I did not copy Lore's name.
This is a very long document that says nothing about chunking at first skim. If chunking is actually wrong, then just explain why, here. Wasting space is not actually a problem if it’s optimized for other purposes instead.
When it comes to large assets, wasting large chunks of space is a problem. If your chunks are 64 kib average (from the Lore document), but changes only average 1 kib (which could be a high estimate), then you will still run out of space 64 times faster and need to read 64 times more data off of the disk for certain operations.
It also makes diffing hard, as well as diff viewing.
Seems like if Lore wants to reduce space usage, they could apply something like Git's delta compression (as used in packfiles) to the chunks.
Suppose you make a 1 kB change in a 50 MB file. That causes a 64 kB chunk to be created and stored. Disk space is wasted.
But since the 50 MB file was already stored as a sequence of 64 kB chunks, there is an existing 64 kB chunk that is very similar to your new 64 kB chunk. You can store your new chunk as a delta to that, so only ~1 kB of disk space is used.
Admittedly, it's complicated and inelegant. But it allows both deduplication between files (one of the reasons Lore chose chunks, apparently) and efficient space usage for small changes.
What do you do instead of chunking your snapshots? Storing diffs is usually the other approach.
3 replies →
[flagged]