Comment by codeulike

14 days ago

I've been thinking about this a lot - nearly every problem these days is a synchronisation problem. You're regularly downloading something from an API? Thats a sync. You've got a distributed database? Sync problem. Cache Invalidation? Basically a sync problem. You want online and offline functionality? sync problem. Collaborative editing? sync problem.

And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).

The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.

> The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech

I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.

Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.

  • Indeed, the simple approach of "send your operations to the server and it will apply them in the order it receives them" gives you good-enough conflict resolution in many cases.

    It is still tempting to turn to CRDTs to solve the next problem: how to apply server-side changes to a client when the client has its own pending local operations. But this can be solved in a fully general way using server reconciliation, which doesn't restrict your operations or data structures like a CRDT does. I wrote about it here: https://mattweidner.com/2024/06/04/server-architectures.html...

    • Just got to reading this.

      > how to apply server-side changes to a client when the client has its own pending local operations

      I liked the option of restore and replay on top of the updated server state. I’m wondering when this causes perf issues? First local changes should propagate fast after eg a network partition, even if the person has queued up a lot of them (say during a flight).

      Anyway, my thinking is that you can avoid many consensus problems by just partitioning data ownership. The like example is interesting in this way. A like count is an aggregate based on multiple data owners, and everyone else just passively follows with read replication. So thinking in terms of shared write access is the wrong problem description, imo, when in reality ”liked posts” is data exclusively owned by all the different nodes doing the liking (subject to a limit of one like per post). A server aggregate could exist but is owned by the server, so no shared write access is needed.

      Similarly, say you have a messaging service. Each participant owns their own messages and others follow. No conflicts are needed. However, you can still break the protocol (say liking twice). Those can be considered malformed and eg ignored. In some cases, you can copy someone else’s data and make it your own: for instance to protect against impersonations: say that you can change your own nickname, and others follow. This can be exploited to impersonate but you can keep a local copy of the last seen nickname and then display a ”changed name” warning.

      Anyway, I’m just a layman who wants things to be simple. It feels like CRDTs have been the ultimate nerd-snipe, and when I did my own evaluations I was disappointed with how heavyweight and opaque they were a few years ago (and probably still).

  • > Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

    I agree with this. CRDTs are cool tech but I think in practice most folks would be surprised by the high percentage of use cases that can be solved with much simpler conflict resolution mechanism (and perhaps combined with server reconciliation as Matt mentioned). I also agree that collaborative document editing is a niche where CRDTs are indeed very useful.

  • > I believe we need is end-to-end bidirectional stream based data communication

    I suspect the generalized solution is much harder to achieve, and looks more like batch-based reconciliation of full snapshots than streaming or event-driven.

    The challenge is if you aim to sync data sources where the parties managing each data source are not incentivized to provide robust sync. Consider Dropbox or similar, where a single party manages the data set, and all software (server and clients), or ecosystems like Salesforce and Mulesoft which have this as a stated business goal, or ecosystems like blockchains where independent parties are still highly incentivized to coordinate and have technically robust mechanisms to accomplish it like Merkle trees and similar. You can achieve sync in those scenarios because independent parties are incentivized to coordinate (or there is only one party).

    But if you have two or more independent systems, all of which provide some kind of API or import/export mechanisms, you can never guarantee those systems will stay in sync using a streaming or event-driven approach. And worse, those systems will inevitably drift out of sync, or even more worse, will propagate incorrect data across multiple systems, which can then only be reconciled by batch-like point-in-time snapshots, which then begs the question of why use streaming if you ultimately need batch to make it work reliably.

    Put another way, people say batch is a special case of streaming, so just use streaming. But you could also say streaming is a fragile form of sync, so just use sync. But sync is a special case of batch, so just use batch.

  • > In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs

    Or CRDTs at all. Google Docs is based on operational transforms and Figma on what they call multiplayer technology.

I agree! Lots more things are sync. Also: the state of my source files -> my compiler (in watch mode), about 20 different APIs in the kernel - from keyboard state to filesystem watching to process monitoring to connected USB devices.

Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!

CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.

I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.

  • And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.

    In most cases, the easiest approach there is just "slap a blockchain on it", as a good and modern (think Ethereum, not Bitcoin) blockchain essentially "abstracts away" the decentralization and mostly acts like a centralized computer to higher layers.

    That is certainly not the only viable approach, and I wish we looked at others more. For example, a decentralized DNS-like system, without an attached cryptocurrency, but with global consensus on what a given name points to, would be extremely useful. I'm not convinced that such a thing is possible, you need some way of preventing one bad actor from grabbing all the names, and monetary compensation seems like the easiest one, but we should be looking in this direction a lot more.

    • > And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.

      In my mind, this is just the second category again. It’s just a shared data system, except with data validation & Byzantine fault tolerance requirements.

      It’s a surprisingly common and thorny problem. For example, I could change my local git client to generate invalid / wrong hashes for my commits. When I push my changes, other peers should - in some way - reject them. PVH (of Ink&Switch) has a rule when thinking about systems like this. He says you’re free to deface your own copy of the US constitution. But I don’t have to pull your changes.

      Access control makes the BFT problem much worse. The classic problem is that if two admins concurrently remove each other, it’s not clear what happens. In a crdt (or git), peers are free to backdate their changes to any arbitrary point in the past. If you try and implement user roles on top of a crdt, it’s a nightmare. I think CRDTs are just the wrong tool for thinking about access control.

  • I can't wait to read that blog post. I know you're an expert in this and respect your views.

    One thing I think that is missing in the discussion about shared data (and maybe you can correct me) is that there are two ways of looking at the problem: * The "math/engineering" way, where once state is identical you are done! * The "product manager" way where you have reasonable-sounding requests like "I was typing in the middle of a paragraph, then someone deleted that paragraph, and my text was gone! It should be its own new paragraph in the same place."

    Literally having identical state (or even identical state that adheres to a schema) is hard enough, but I'm not aware of techniques to ensure 1) identical state 2) adhering to a schema 3) that anyone on the team can easily modify in response to "PM-like" demands without being a sync expert.

> And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).

I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.

CRDTs work well for linear data structures, but there are known issues with hierarchical ones. For instance, if you have a tree, then two clients could send a transaction that would cause a node to be a parent of itself.

That said, there’s work that has been done towards fixing some of those issues.

Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:

https://martin.kleppmann.com/papers/move-op.pdf

  • Martin Kleppmann in one of his recent talks about the future of local-first, mentions the need for a generic sync service for the 'local-first end-game' [0] as he calls it. Standardization is needed. Right now everyone and their mother is doing sync differently and building production platforms around their own protocols and mechanisms.

    [0] https://www.youtube.com/watch?v=NMq0vncHJvU&t=1016s

    • The problem is that the requirements can be vastly different. A collaborative editor is very different to say syncing encrypted blobs. Perhaps there is a one size fits all but I doubt it.

      I've been working on sync for the latter use case for a while and CRDTs would definitely be overkill.

  • Automatic conflict resolution will always be limited. For example, who seriously believes that we’ll ever be able to fully automate the handling of merge conflicts in version control? (Even if recorded every single edit operation on the syntax-tree level.) And in regular documents the situation is worse, because you don’t have formal parsers and type checkers and unit tests for them. Even for schematized structured data, there are similar issues on the semantic level, that a mere “it conforms to the schema” doesn’t solve.

  • As long as all clients agree on the order of CRDT operations then cycles are no problem. It's just an invalid transaction that can be dropped. Invalid or contradictory updates can always happen (regardless of sync mechanism) and the resolution is a UX issue. In some cases you might want to inform the user, in other cases the user can choose how to resolve the conflict, in other cases quiet failure is fine.

    • Unfortunately, a hard constraint of (state-based) CRDTs is that merging causally concurrent changes must be commutative. ie it is possible that clients will not be able to agree on the order of CRDT operations, and they must be able to arrive at the same state after applying them in any order.

      4 replies →

I've looked at CRDTs, and the concept really appeals to me in the general case, but in the specific cases, my design always ends up being "keep-all-the-facts" about a particular item. But then you defer the problem of 'which facts can I throw away?'. It's like inventing a domain-specific GC.

I'd love to hear about any success cases people have had with CRDTs.

  • There was an article on this website not so long ago about using CRDTs for collaborative editing and there was this silly example to show how leaky this abstraction can be. What if your have the word "color" and one user replaces it with "colour" and another deletes the word, what does the CRDT do in this case? Well it merges this two edits into "u". This sort of makes me skeptical of using CRDTs for user facing applications.

    • There isn’t a monolithic “CRDT” in the way you’re describing. CRDTs are, broadly, a kind of data structure that allows clients to eventually agree on a final state without coordination. An integer `max` function is a simple example of a CRDT.

      The behavior the article found is peculiar to the particular CRDT algorithms they looked at. But they’re probably right that it’s impossible for all conflicting edits to “just work” (in general, not just with CRDTs). That doesn’t mean CRDTs are pointless; you could imagine an algorithm that attempts to detect such semantic conflicts so the application can present some sort of resolution UI.

      Here’s the article, if interested (it’s very good): https://www.moment.dev/blog/lies-i-was-told-pt-1

      1 reply →

  • It's still early, but we have a checkpointing system that works very well for us. And once you have checkpoints you can start dropping inconsequential transactions in between checkpoints, which you're right, can be considered GC. However, checkpointing is desirable anyway otherwise new users have to replay the transaction log from T=0 when they join, and that's impractical.

    • I've also had success with this method. "domain-specific GC" is a fitting term.

  • For me the main issue with CRDTs is that they have a fixed merge algorithm baked in - if you want to change how conflicts get resolved, you have to change the whole data structure.

    • I feel like the state-of-the-art here is slowly starting to change. I think CRDTs for too many years got too caught up in "conflict-free" as a "manifest destiny" sort of thing more than "hope and prayer" and thought they'd keep finding the right fixed merged algorithm for every situation. I started watching CRDTs from the perspective of source control and having a strong inkling that "data is always messy" and "conflicts are human" (conflicts are kind of inevitable in any structure trying to encode data made by people).

      I've been thinking for a bit that it is probably about time the industry renamed that first C to something other than "conflict-free". There is no freedom from conflicts. There's conflict resistance, sure and CRDTs can provide in their various data structures a lot of conflict resistance. But at the end of the day if the data structure is meant to encode an application for humans, it needs every merge tool and review tool and audit tool it can offer to deal with those.

      I think we're finally starting to see some of the light in the tunnel in the major CRDT efforts and we're finally leaving the detour of "no it must be conflict-free, we named it that so it must be true". I don't think any one library is yet delivering it at a good high level, but I have that feeling that "one of the next libraries" is maybe going to start getting the ergonomics of conflict handling right.

      3 replies →

    • I've been running into this with automated regex edits. Our product (Relay [0]) makes Obsidian real-time collaborative using yjs, but I've been fighting with the automated processes that rewrites markdown links within notes.

      The issue happens when a file is renamed by one client, and then all other clients pick up the rename and make the change to the local files on disk. Since every edit is broken down into delete/keep/insert runs, the automated process runs rapidly in all clients and can break the links.

      I could limit the edits to just one client, but it feels clunky. Another thought I've had is to use ytext annotations, or just also store a ymap of the link metadata and only apply updates if they can meet some kind of check (kind of like schema validation for objects).

      If anyone has a good mental model for modeling automated operations (especially find/replace) in ytext please let me know! (email in bio).

      [0] https://system3.md/relay

Absolutely. My current product relies heavily on a handful of partner systems and, adds an opinionated layer on top of these systems, and propagates data to CRM, DW, and other analytical systems.

One early insight was that we needed a representation of partner data in our database (and the downstream systems need a representation of our opinionated view as well). This is clearly an (eventually consistent) synchronization problem.

We also realized that we often either fail to sync (due to bugs, timing, or whatever) and need a regular process to resync data.

We've ended up with a homegrown framework that does both things, such that the same business logic gets used in both cases. This also makes it easy to backfill data if a chosen representation changes)

We're now on the third or fourth iteration of this system and I'm pretty happy with it.

  • Once you add a periodic resync you have moved the true synchronization away from the online "(eventually consistent) synchronization" and into the batch resync. At that point the online synchronization is just a performance optimization on top of the batch resync.

    I've been in that situation a lot, and I'd always carefully consider if you even need the online synchronization at that point. It's pretty rarely required.

    • In our case it absolutely is. There are user facing flows that require data from partner systems to complete. Waiting for the next sync cycle isn't a good UX.

UI is also a sync problem if you squint a bit. React like systems are an attempt to be a sync engine between model and view in a sense.

Multiplayer games too.