Comment by nonbirithm
5 years ago
A question I always have is if CDRTs solve some problem with collaborative editing, then can git's merge algorithm be rewritten to use CDRTs and benefit from it somehow?
Somehow I think the answer is no. There is a reason we still have to manually drop down to a diff editor to resolve certain kinds of conflicts after many decades.
I think a better question is “what if merges were more well behaved,” where “well behaved” means they have nice properties like associativity and having the minimal amount of conflict without auto-resolving any cases that should actually be a conflict.
The problem with using a CRDT is the CR part: there are generally merge conflicts in version control for a reason. If your data type isn’t “state of the repo with no conflicts” or “history of the repo and current state with no conflicts” but something like “history of the repo and current state including conflicts from unresolved merges” then maybe that would work but it feels pretty complicated to explain and not very different from regular git. Also note that you need history to correctly merge (if you do a 3-way merge of a history of a file of “add line foo; delete line foo” with a history of “add line foo; delete line foo; add line foo” and common ancestor “add line foo”, you should end with a history equal to the second one I described. But if you only look at the files you will probably end up deleting foo)
See also: darcs and pijul.
It's perfectly CR to encode a merge conflict as another type of data in the CRDT lattice.
For documents, you might represent this as a merged document, with merge conflict markup inside the merged document - similar to using version control and getting a merge conflict.
Also similar to version control, the merge conflict can be a form of data where a valid CRDT operation is for a user to resolve the conflict. When that resolution is merged with other users, unless there's a conflict in the conflict-resolution, everyone sees the merge conflict go away.
Another valid CRDT operation is when a user modifies their version of the document in the region of the merge conflict prior to seeing the conflict, and broadcasts their edit. Then the merge conflict itself would update to absorb the valid change. In some cases, it might even disappear.
In principle, you can build a whole workflow stack on top of this concept, with sign-offs and reviews, just as with version control. I have no idea how well it would behave in practice.
The answer is no, but unlike git, crdts make a choice for you, and all nodes get convergent consistency. The problem heretofore with crdts is that those choices have not been sane. I think there are a recent crop of crdts that are "95% sane" and honestly that's probably good enough. There is an argument that optimal human choices will never be reconciliable with commutativity, which I totally buy, but I think there is also an argument for "let not the perfect be the enemy of the awesome". And having made a choice, even if it's not optimal, is a much firmer ground to build upon than blocking on leaving a merge conflict undecided.
Git mostly treats merging as a line oriented diff problem. Even though you can specify language aware diffing in theory it doesn't seem to buy you much in practice (based on my experience with the C# language-aware diff).
It wouldn't make much sense to me to just plug a text CRDT in place of a standard text diff. CRDTs like automerge are capable of representing more complex tree structures however and if you squint you can sort of imagine a world where merging source code edits was done at something more like the AST level rather than as lines of text.
I've had some ugly merge conflicts that were a mix of actual code changes and formatting changes which git diffs tend not to be much help with. A system that really understood the semantic structure of the code should in theory be able to handle those a lot better.
IDEs have powerful refactoring support these days like renaming class members but source control is ignorant of those things. One can imagine a more integrated system that could understand a rename as a distinct operation and have no trouble merging a rename with an actual code change that touched some code that referenced the renamed thing in many situations. Manual review would probably still be necessary but the automated merge could get it right a much higher percentage of the time.
For specific use cases where the data format is not plain-text and or is formatted as json (e.g. Jupyter notebooks, rich text editing like ProseMirror), I can see CRDTs being used to automatically merge files and retain consistency. Two things though:
1. this doesn't require modifying git's merge algorithm itself; just having custom merge drivers built on top.
2. Is using more space (edit history of CRDTs vs a regular json document) worth it for automatic merging?