Comment by oulipo

6 months ago

Seems interesting, but I'm not sure what duplication means in this context? Is Kafka sending several time the same row? and for what reasons?

Could you give practical examples where duplication happens?

My use-case is IoT with devices connecting on MQTT and sending batches of data, each time we ingest a batch we stream all corresponding rows in database, because we only ingest a batch once, I don't think there can really be duplicates, so I don't think I would be the target of your solution,

but I'm still curious at in which case such things happen, and why couldn't Kafka or Clickhouse dedup themselves using some primary key or something?

2 comments

oulipo

super_ar 6 months ago

Thanks for asking those questions. Duplicates often come from how systems interact with Kafka, not from Kafka itself. For example, if a service retries sending a message after a timeout or if you collect similar data from multiple sources (like CRMs and web apps), you can end up with the same event multiple times. Kafka guarantees delivery at least once, so it doesn't remove duplicates.

ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.

With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.

In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.

oulipo 6 months ago

Thanks interesting! In my case each batch has a unique "batch id", and it's ingested in Postgres/Timescale so it will dedup with the key