Comment by fwip
5 years ago
CRDTs don't guarantee coherence, but instead guarantee consistency.
The result may often be coherent at the sentence level if the edits are normal human edits, but often will not be at the whole-document level.
For a simplistic example, if one person changes a frequently-used term throughout the document, and another person uses the old term in a bunch of places when writing new content, the document will be semantically inconsistent, even though all users made semantically consistent changes and are now seeing the same eventually-consistent document.
For a contrived example of local inconsistency, consider the phrase "James had a bass on his wall." Alice rewrites this to "James had a bass on his wall, a trophy from his fishing trip last summer," and Brianna separately chooses "James, being musically inclined, had hung his favorite bass on his wall." The CRDT dutifully applies both edits, and resolves this as: "James, being musically inclined, had hung his favorite bass on his wall, a trophy from his fishing trip last summer."
In nearly any system, semantic data is not completely represented by any available data model. Any automatic conflict-resolution model, no matter how smart, can lead to semantically-nonsensical merges.
CRDTs are very very cool. Too often, though, people think that they can substitute for manual review and conflict resolution.
Right. The problem CRDTs solve is the problem of the three-way merge conflict in git: the problem of the "correct" merge being underspecified by the formalism, and so implementation dependent.
If two different git clients each implemented some automated form of merge-conflict resolution; and then each of them tried to resolve the same conflicting merge; then each client might resolve the conflict in a different, implementation-dependent way, resulting in differing commits. (This is already what happens even without automation—the "implementation" being depended upon is the set of manual case-by-case choices made by each human.)
CRDTs are data structures that explicitly specify, in the definition of what a conforming implementation would look like, how "merge conflicts" for the data should be resolved. (Really, they specify their way around the data ever coming into conflict — thus "conflict-free" — but it's easier to talk about them resolving conflicts.)
In the git analogy, you could think of a CRDT as a pair of "data-format aware" algorithms: a merge algorithm, and a pre-commit validation algorithm. The git client would, upon commit, run the pre-commit validation algorithm specific to the file's type, and only actually accept the commit if the modified file remained "mergeable." The client would then, upon merge, hand two of these files to a file-type-specific merge algorithm, which would be guaranteed to succeed assuming both inputs are "mergeable." Which they are, because we only let "mergeable" files into commits.
Such a framework, by itself, doesn't guarantee that anything good or useful will come out the other end of the process. Garbage In, Garbage Out. What it does guarantee, is that clients doing the same merge, will deterministically generate the same resulting commit. It's up to the designer of each CRDT data-structure to specify a useful merge algorithm for it; and it's up to the developer to define their data in terms of a CRDT data-structure that has the right semantics.
That just sparked a thought.
For a codebase, unit tests could be the pre-commit validation algorithm. Then, as authors continue to edit the piece, they both add unit tests, and merge the code. In the face of a merge, the tests could be the deciding factor between what emerges.
Of course, unless you have conflicts in the tests themselves.
So the CRDTs could be applied to a document and an edit/change log to guarantee the consistency of the log and its entries, not necessarily the document itself?