← Back to context

Comment by klabb3

14 days ago

> The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech

I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.

Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.

Indeed, the simple approach of "send your operations to the server and it will apply them in the order it receives them" gives you good-enough conflict resolution in many cases.

It is still tempting to turn to CRDTs to solve the next problem: how to apply server-side changes to a client when the client has its own pending local operations. But this can be solved in a fully general way using server reconciliation, which doesn't restrict your operations or data structures like a CRDT does. I wrote about it here: https://mattweidner.com/2024/06/04/server-architectures.html...

  • Just got to reading this.

    > how to apply server-side changes to a client when the client has its own pending local operations

    I liked the option of restore and replay on top of the updated server state. I’m wondering when this causes perf issues? First local changes should propagate fast after eg a network partition, even if the person has queued up a lot of them (say during a flight).

    Anyway, my thinking is that you can avoid many consensus problems by just partitioning data ownership. The like example is interesting in this way. A like count is an aggregate based on multiple data owners, and everyone else just passively follows with read replication. So thinking in terms of shared write access is the wrong problem description, imo, when in reality ”liked posts” is data exclusively owned by all the different nodes doing the liking (subject to a limit of one like per post). A server aggregate could exist but is owned by the server, so no shared write access is needed.

    Similarly, say you have a messaging service. Each participant owns their own messages and others follow. No conflicts are needed. However, you can still break the protocol (say liking twice). Those can be considered malformed and eg ignored. In some cases, you can copy someone else’s data and make it your own: for instance to protect against impersonations: say that you can change your own nickname, and others follow. This can be exploited to impersonate but you can keep a local copy of the last seen nickname and then display a ”changed name” warning.

    Anyway, I’m just a layman who wants things to be simple. It feels like CRDTs have been the ultimate nerd-snipe, and when I did my own evaluations I was disappointed with how heavyweight and opaque they were a few years ago (and probably still).

> Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

I agree with this. CRDTs are cool tech but I think in practice most folks would be surprised by the high percentage of use cases that can be solved with much simpler conflict resolution mechanism (and perhaps combined with server reconciliation as Matt mentioned). I also agree that collaborative document editing is a niche where CRDTs are indeed very useful.

> I believe we need is end-to-end bidirectional stream based data communication

I suspect the generalized solution is much harder to achieve, and looks more like batch-based reconciliation of full snapshots than streaming or event-driven.

The challenge is if you aim to sync data sources where the parties managing each data source are not incentivized to provide robust sync. Consider Dropbox or similar, where a single party manages the data set, and all software (server and clients), or ecosystems like Salesforce and Mulesoft which have this as a stated business goal, or ecosystems like blockchains where independent parties are still highly incentivized to coordinate and have technically robust mechanisms to accomplish it like Merkle trees and similar. You can achieve sync in those scenarios because independent parties are incentivized to coordinate (or there is only one party).

But if you have two or more independent systems, all of which provide some kind of API or import/export mechanisms, you can never guarantee those systems will stay in sync using a streaming or event-driven approach. And worse, those systems will inevitably drift out of sync, or even more worse, will propagate incorrect data across multiple systems, which can then only be reconciled by batch-like point-in-time snapshots, which then begs the question of why use streaming if you ultimately need batch to make it work reliably.

Put another way, people say batch is a special case of streaming, so just use streaming. But you could also say streaming is a fragile form of sync, so just use sync. But sync is a special case of batch, so just use batch.

> In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs

Or CRDTs at all. Google Docs is based on operational transforms and Figma on what they call multiplayer technology.