Comment by davidelettieri

3 months ago

It's worth mentioning debezium https://debezium.io/

It allows to publish all changes from the db to Kafka.

8 comments

davidelettieri

Perhaps the situation has gotten better since I looked a few years ago, but my experience is the Debezium project doesn’t really guarantee exactly-once delivery. Meaning that if row A is replaced by row B, you might see (A, -1), (A, -1), (B, +1), if for example Debezium was restarted at precisely the wrong time. Then if you’re using this stream to try to keep track of what’s in the database, you will think you have negatively many copies of A.

It sounds silly, but caused enormous headaches and problems for the project I was working on (Materialize), one of whose main use cases is creating incrementally maintained live materialized views on top of replicated Postgres (or MySql) data.

nchmy 3 months ago

Debezium published this doc on Exactly-once delivery with their most recent 3.3.0 version. They dont support it natively, but say it can be achieved via Kafka Connect
https://debezium.io/documentation/reference/stable/configura...
You could probably achieve something similar with the NATS Jetstream sink as well, which has similar capabilities - though I think it doesnt have quite the same guarantees.
I switched to using Debezium a few months ago, after a Golang alternative to Debezium + Kafka Connect - ConduitIO - went defunct. I should have been using Debezium all along, as it is clearly the most mature, most stable option in the space, with the best prospects for long-term survival. Highly recommended, even if it is JVM (though they're currently doing some interesting stuff with Quarkus and GraalVM that might lead to a jvm-free binary at some point)
gunnarmorling 3 months ago
Debezium generally produces each change event exactly once if there are no unclean connector shut-downs. If that's not the case, I'd consider this a bug which ought to be fixed.
(Disclaimer: I used to lead the Debezium project)
- umanwizard 3 months ago
  
  The problem is that unclean connector shutdowns are a thing that can happen in real life.
  
  2 replies →

oulipo2 3 months ago

Does it handle the things that the post mentions about the ever-growing WAL, and the fact that some listeners can go offline and need to get back old messages (eg if Kafka crashes?)

gunnarmorling 3 months ago

Robustness is a key design goal of Debezium. It supports heart beating to address WAL growth issues (wrote about that issue at [1]). If Kafka crashes (or Debezium itself), it will resume consuming the replication slot from where it left off before (applying at-least once semantics, i.e. there can be duplicates in case of an unclean shut-down).
Naturally, if the consumer is down, WAL retained for that replication slot continues to grow until it comes back up again, hence monitoring is key (or the slot gets invalidated at a certain threshold, it will restart with a new initial snapshot).
Disclaimer: I used to lead the Debezium project
[1] https://www.morling.dev/blog/mastering-postgres-replication-...