← Back to context

Comment by wvh

8 months ago

The problem is that you don't know who's listening. You don't want all possible interested parties to hammer the database. Hence the events in between. Arguably, I'd not use Kafka to store actual data, just to notify in-flight.

But you do know who's listening, because you were the one who installed all the listeners. ("you" can be plural.)

This reminds me of the OOP vs DOD debate again. OOP adherents say they don't know all the types of data their code operates on; DOD adherents say they actually do, since their program contains a finite number of classes, and a finite subset of those classes can be the ones called in this particular virtual function call.

What you mean is that your system is structured in such a way that it's as if you don't know who's listening. Which is okay, but you should be explicit that it's a design choice, not a law of physics, so when that design choice no longer serves you well, you have the right to change it.

(Sometimes you really don't know, because your code is a library or has a plugin system. In such cases, this doesn't apply.)

> Arguably, I'd not use Kafka to store actual data, just to notify in-flight.

I believe people did this initially and then discovered the non-Kafka copy of the data is redundant, so got rid of it, or relegated it to the status of a cache. This type of design is called Event Sourcing.

  • I have worked for financial institutions where random departments have random interests in certain data transactions. You (as in a dev team in one such department) have no say in who touches the data, from where, and how it's used. Kafka is used as a corporate message bus to let e.g. the accountant department know something happened in another department. Those "listening" departments don't have devs and are not involved in development, they operate more on the MS BI level of PowerPoint.

    So yes, in large companies, your development team is just a small cog, you don't set policy for what happens to the data you gather. And in some sectors, like finances, you are an especially small cog with little power, which might sound strange if you only ever worked for a software startup.

In some databases that's not a problem. Oracle has a built in horizontally scalable message queue engine that's transactional with the rest of the database. You can register a series of SELECT queries and be notified when the results have (probably) changed, either via direct TCP server push or via a queued message for pickup later. It's not polling based, the transaction engine knows what query predicates to keep an eye out for.

Disclosure: I work part time for Oracle Labs and know about these features because I'm using them in a project at the moment.

  • I know as an Oracle employee you don't want to hear this, but part of the problem is that you are no longer database-agnostic if you do this.

    The messaging tech being separate from the database tech means the architects can swap out the database if needed in the future without needing to rewrite the producers and consumers.

    • I don't work on the database itself, so it's neither here nor there to me. Still, the benefits of targeting the LCD must be weighed against the costs. Not having scalable transactions imposes a huge drain on the engineering org that sucks up time and imposes opportunity costs.

For read only queries, hammer away I can scale reads nigh infinitely horizontally. There's no secret sauce that makes it so that only kafka can do this.

  • It's not about scaling reads, but coordinating consumers so no more than one consumer processes same messages. That means some kind of locking, that means scaling issues.