← Back to context

Comment by dpflan

7 months ago

Can you elaborate on how you have “adapted…message causality topologies to cope with consuming mechanisms” in relation to the example of a bank account? The causality topology being what here, couldn’t one day MoneyIn should come before else there can be now true MoneyOut?

Right on, great question. Some examples:

Example Option 1

You give up on the guarantees across partition keys (bank accounts), and you accept that balances will not reflect a causally consistent state of the past.

E.g., Bob deposits 100, Bob sends 50 to Alice.

Balances: Bob 0 Alice 50 # the source system was never in this state Bob 100 Alice 50 # the source system was never in this state Bob 50 Alice 50 # eventually consistent final state

Example Option 2

You give up on parallelism, and consume in total order (i.e., one single partition / unit of parallelism - e.g., in Kafka set a partitioner that always hashes to the same value).

Example Option 3

In the consumer you "wait" whenever you get a message that violates causal order.

E.g., Bob deposits 100 Bob sends 50 to Alice (Bob-MoneyOut 50 -> Alice-MoneyIn 50).

If we attempt to consume Alice-MoneyIn before Bob-MoneyOut, we exponentially back off from the partition containing Alice-MoneyIn.

(Option 3 is terrible because of O(n^2) processing times in the worst case and the possibility for deadlocks (two partitions are waiting for one another))

  • Thanks. With these of examples of messages appearance in time and in physical location in Kafka, how have you adapted your consumers? Which scenario / architectural decision (one of the examples?) have you moved forward with and creating support to yield your desired causality handling?

    • Option 1, but after so many years banging our heads against the wall reasoning about this, we hoped someone would eventually give us a queue that supports arbitrary causal dependency graphs.

      We thought about building it ourselves, because we know the data structures, high level algorithms, and disk optimizations required. BUT we pivoted our company, so we've postponed this for the foreseeable future. After all, theory is relatively easy, but a true production grade implementation takes years.

  • hmmm... could this be solved by "vector clocks"? if producers are emitting something that depends on a previous event they send the id of the previous event. (so like capabilities, you need proof of "data access".)

    or the problem is that again this is O(n^2)? (because then the consumers now need to buffer [potentially] n key streams (and then search for them every time - so "n" times)?