Comment by jb_s

5 years ago

This immediately struck me when I was reading this article.

To be honest, this whole paradigm seems absurdly fucking efficient for the developers. But I wonder about stuff like

* What happens if the data model needs to change? If you need to move something from db["some/path"]?

* How is it coordinated at a larger scale, how does everyone know what is running and how it interacts with everything else - can you figure out what depends on an object? What if the data used by your Price(Security) object changes and breaks it?

> What happens if the data model needs to change?

You write conversions and there's a registry where you register them to be picked up by the unpickler. If necessary you can also customize the logic that determines which version a given pickled datum uses to deserialize. There aren't so many guardrails when you're writing that stuff, but the infrastructure does its best to support you.

> If you need to move something from db["some/path"]?

There's support for both symlinks (db["some/path"] -> db["other/path"]) and for a kind of hardlink by making both paths point to the same inode-line id. You can usually find a way to do what you need to.

> How is it coordinated at a larger scale, how does everyone know what is running and how it interacts with everything else - can you figure out what depends on an object? What if the data used by your Price(Security) object changes and breaks it?

There's a common model for the things that are shared, and that has a versioning and release/deprecation cycle. Otherwise every type has an owner and you probably had to request their permissions to read their data, so you should have a channel of communication with them. But yeah people do rely on the fundamental business entities not changing too quickly, and things do break when changes are made.

  • There's also a graph debugger that allows you to step through the dependency graph node-by-node across the various globally distributed databases.

    • True but not really helpful for this problem, because it can only tell you about the job you're debugging, whereas what you want to know is what code might ever depend on that data.