← Back to context

Comment by archagon

5 years ago

In the case of a text document, concurrent edits form branches of a tree in many string CRDTs: http://archagon.net/blog/2018/03/24/data-laced-with-history/

So yes, hundreds of people can edit a string and produce a coherent result at the end. Contiguous runs of characters will stick together and interleave with concurrent edits.

CRDTs don't guarantee coherence, but instead guarantee consistency.

The result may often be coherent at the sentence level if the edits are normal human edits, but often will not be at the whole-document level.

For a simplistic example, if one person changes a frequently-used term throughout the document, and another person uses the old term in a bunch of places when writing new content, the document will be semantically inconsistent, even though all users made semantically consistent changes and are now seeing the same eventually-consistent document.

For a contrived example of local inconsistency, consider the phrase "James had a bass on his wall." Alice rewrites this to "James had a bass on his wall, a trophy from his fishing trip last summer," and Brianna separately chooses "James, being musically inclined, had hung his favorite bass on his wall." The CRDT dutifully applies both edits, and resolves this as: "James, being musically inclined, had hung his favorite bass on his wall, a trophy from his fishing trip last summer."

In nearly any system, semantic data is not completely represented by any available data model. Any automatic conflict-resolution model, no matter how smart, can lead to semantically-nonsensical merges.

CRDTs are very very cool. Too often, though, people think that they can substitute for manual review and conflict resolution.

  • Right. The problem CRDTs solve is the problem of the three-way merge conflict in git: the problem of the "correct" merge being underspecified by the formalism, and so implementation dependent.

    If two different git clients each implemented some automated form of merge-conflict resolution; and then each of them tried to resolve the same conflicting merge; then each client might resolve the conflict in a different, implementation-dependent way, resulting in differing commits. (This is already what happens even without automation—the "implementation" being depended upon is the set of manual case-by-case choices made by each human.)

    CRDTs are data structures that explicitly specify, in the definition of what a conforming implementation would look like, how "merge conflicts" for the data should be resolved. (Really, they specify their way around the data ever coming into conflict — thus "conflict-free" — but it's easier to talk about them resolving conflicts.)

    In the git analogy, you could think of a CRDT as a pair of "data-format aware" algorithms: a merge algorithm, and a pre-commit validation algorithm. The git client would, upon commit, run the pre-commit validation algorithm specific to the file's type, and only actually accept the commit if the modified file remained "mergeable." The client would then, upon merge, hand two of these files to a file-type-specific merge algorithm, which would be guaranteed to succeed assuming both inputs are "mergeable." Which they are, because we only let "mergeable" files into commits.

    Such a framework, by itself, doesn't guarantee that anything good or useful will come out the other end of the process. Garbage In, Garbage Out. What it does guarantee, is that clients doing the same merge, will deterministically generate the same resulting commit. It's up to the designer of each CRDT data-structure to specify a useful merge algorithm for it; and it's up to the developer to define their data in terms of a CRDT data-structure that has the right semantics.

    • That just sparked a thought.

      For a codebase, unit tests could be the pre-commit validation algorithm. Then, as authors continue to edit the piece, they both add unit tests, and merge the code. In the face of a merge, the tests could be the deciding factor between what emerges.

      Of course, unless you have conflicts in the tests themselves.

  • So the CRDTs could be applied to a document and an edit/change log to guarantee the consistency of the log and its entries, not necessarily the document itself?

I upvote for the link alone. This article (data-laced-with-history) is the best source if you are starting your journey into CRDTs.

What if the document starts empty and syncing doesn't happen until everyone presses submit? Will it CRDTs produce a valid document? Yes. Will it make any sense? Who knows. I think that's what OP is getting at.

  • I read it as a question regarding OT vs. CRDTs, which I believe would produce similar results even under heavy concurrency. In terms of larger edits or refactors, you’d probably need to do something else, e.g. lock the document or section, unshare the document, use some sort of higher-level CRDT that ships your changes atomically and forces a manual merge on concurrent edits, etc. None of these necessarily require a central server, though they may require an active session between participants.

    I should also note that even if you use regular merge, and the end state of a text document is a complete mess after a refactor + concurrent edits, there’s enough data in the tree to simply pull out any concurrent contributions. They could then be reapplied manually if needed. Perhaps the app could even notice this automatically and provide an optional UI for this process. Similarly, it would be possible for the concurrent editors to remove the refactor edits and thus “fork” their document.

    • My question was not meant to be OT versus CRDT. Rather, I am questioning expectations at that shared editing use case.

      Comparing to git (as others have done) is interesting. The expectation is any merge is manually tested by the user. Such that it is not just the git actions at play, but all support activity. That is, the user flow assumes all intermediate states are touched and verified by a user. Where this is skipped, things increase the risk of being broken. (Is why git bisect often fails projects that don't build every commit.)

      Same for games. Some machine gets to set the record straight as to what actually happened. Pretty much always. The faster the path to the authority for every edit, the higher chance of coherence.

      With hundreds of authorities, machine or not, this feels intractable.

      1 reply →