← Back to context

Comment by josephg

14 days ago

I agree! Lots more things are sync. Also: the state of my source files -> my compiler (in watch mode), about 20 different APIs in the kernel - from keyboard state to filesystem watching to process monitoring to connected USB devices.

Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!

CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.

I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.

And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.

In most cases, the easiest approach there is just "slap a blockchain on it", as a good and modern (think Ethereum, not Bitcoin) blockchain essentially "abstracts away" the decentralization and mostly acts like a centralized computer to higher layers.

That is certainly not the only viable approach, and I wish we looked at others more. For example, a decentralized DNS-like system, without an attached cryptocurrency, but with global consensus on what a given name points to, would be extremely useful. I'm not convinced that such a thing is possible, you need some way of preventing one bad actor from grabbing all the names, and monetary compensation seems like the easiest one, but we should be looking in this direction a lot more.

  • > And then there's the third super-special category of shared data with no central server, and where only certain users should be allowed to perform certain operations. This comes up most often in p2p networks, censorship resistance etc.

    In my mind, this is just the second category again. It’s just a shared data system, except with data validation & Byzantine fault tolerance requirements.

    It’s a surprisingly common and thorny problem. For example, I could change my local git client to generate invalid / wrong hashes for my commits. When I push my changes, other peers should - in some way - reject them. PVH (of Ink&Switch) has a rule when thinking about systems like this. He says you’re free to deface your own copy of the US constitution. But I don’t have to pull your changes.

    Access control makes the BFT problem much worse. The classic problem is that if two admins concurrently remove each other, it’s not clear what happens. In a crdt (or git), peers are free to backdate their changes to any arbitrary point in the past. If you try and implement user roles on top of a crdt, it’s a nightmare. I think CRDTs are just the wrong tool for thinking about access control.

I can't wait to read that blog post. I know you're an expert in this and respect your views.

One thing I think that is missing in the discussion about shared data (and maybe you can correct me) is that there are two ways of looking at the problem: * The "math/engineering" way, where once state is identical you are done! * The "product manager" way where you have reasonable-sounding requests like "I was typing in the middle of a paragraph, then someone deleted that paragraph, and my text was gone! It should be its own new paragraph in the same place."

Literally having identical state (or even identical state that adheres to a schema) is hard enough, but I'm not aware of techniques to ensure 1) identical state 2) adhering to a schema 3) that anyone on the team can easily modify in response to "PM-like" demands without being a sync expert.