Comment by nitwit005
8 months ago
> When producing a record to a topic and then using that record for materializing some derived data view on some downstream data store, there’s no way for the producer to know when it will be able to "see" that downstream update. For certain use cases it would be helpful to be able to guarantee that derived data views have been updated when a produce request gets acknowledged, allowing Kafka to act as a log for a true database with strong read-your-own-writes semantics.
Just don't use Kafka.
Write to the downstream datastore directly. Then you know your data is committed and you have a database to query.
The problem is that you don't know who's listening. You don't want all possible interested parties to hammer the database. Hence the events in between. Arguably, I'd not use Kafka to store actual data, just to notify in-flight.
But you do know who's listening, because you were the one who installed all the listeners. ("you" can be plural.)
This reminds me of the OOP vs DOD debate again. OOP adherents say they don't know all the types of data their code operates on; DOD adherents say they actually do, since their program contains a finite number of classes, and a finite subset of those classes can be the ones called in this particular virtual function call.
What you mean is that your system is structured in such a way that it's as if you don't know who's listening. Which is okay, but you should be explicit that it's a design choice, not a law of physics, so when that design choice no longer serves you well, you have the right to change it.
(Sometimes you really don't know, because your code is a library or has a plugin system. In such cases, this doesn't apply.)
> Arguably, I'd not use Kafka to store actual data, just to notify in-flight.
I believe people did this initially and then discovered the non-Kafka copy of the data is redundant, so got rid of it, or relegated it to the status of a cache. This type of design is called Event Sourcing.
I have worked for financial institutions where random departments have random interests in certain data transactions. You (as in a dev team in one such department) have no say in who touches the data, from where, and how it's used. Kafka is used as a corporate message bus to let e.g. the accountant department know something happened in another department. Those "listening" departments don't have devs and are not involved in development, they operate more on the MS BI level of PowerPoint.
So yes, in large companies, your development team is just a small cog, you don't set policy for what happens to the data you gather. And in some sectors, like finances, you are an especially small cog with little power, which might sound strange if you only ever worked for a software startup.
In some databases that's not a problem. Oracle has a built in horizontally scalable message queue engine that's transactional with the rest of the database. You can register a series of SELECT queries and be notified when the results have (probably) changed, either via direct TCP server push or via a queued message for pickup later. It's not polling based, the transaction engine knows what query predicates to keep an eye out for.
Disclosure: I work part time for Oracle Labs and know about these features because I'm using them in a project at the moment.
I know as an Oracle employee you don't want to hear this, but part of the problem is that you are no longer database-agnostic if you do this.
The messaging tech being separate from the database tech means the architects can swap out the database if needed in the future without needing to rewrite the producers and consumers.
1 reply →
For read only queries, hammer away I can scale reads nigh infinitely horizontally. There's no secret sauce that makes it so that only kafka can do this.
It's not about scaling reads, but coordinating consumers so no more than one consumer processes same messages. That means some kind of locking, that means scaling issues.
Alternatively, your write doesn't have to be fire-and-forget: downstream datastores can also write to kafka (this time fire-and-forget) and the initial client can wait for that event to acknowledge the initial write
Writing directly to the datastore ignores the need for queuing the writes. How do you solve for that need?
Why do you need to queue the writes?
A proficient coder can write a program to accomplish a task in the singular.
In the plural, accomplishing that task in a performant way at enterprise scale seems to involve turning every function call into an asynchronous, queued service of some sort.
Which then begets additional deployment and monitoring services.
A queued problem requires a cute solution, bringing acute pain.
some writes might fail, you may need to retry, the data store may be temporarily available etc.
There may be many things that go wrong and how you handle this depends on your data guarantees and consistency requirements.
If you're not queuing what are you doing when a write fails, throwing away the data?
2 replies →
Yeah... Not happening when you have scores of clients running down your database.
The reason message queue systems exist is scale. Good luck sending a notification at 9am to your 3 million users and keeping your database alive in the sudden influx of activity. You need to queue that load.
Kafka is itself a database. Sending a message requires what is essentially a database insert. You're still doing a DB commit either way.
It's more of a commit log/write-ahead log/replication stream than a DBMS - consider that DBMSs typically include these in addition to their primary storage.
2 replies →
Of course, if you don't have separate downstream and upstream datastores, you don't have anything to do in the first place.