Comment by ashishbagri
5 days ago
- We used our custom clickhouse sink which inserts records in batches using clickhouse native protocol (as recommend by clickhouse). Each insert is done in a single transaction so if an insertion has failed, partial record do not get inserted on clickhouse. - The way the system is architected this cannot happen. If the deduplication server crashes, the entire pipeline is stopped and nothing is inserted. Currently when we read a data successfully from Kafka into our internal NATs JS, we acknowledge the new offset into Kafka. And the deduplication and insertion happens after. The limitation currently is that if our system crashes before inserting into clickhouse (but after ack to kafka) we would not process this data. We are already working towards finding a solution for this.
Right. I think this is fundamental though. You can minimize the chance of duplicates but not avoid them completely under failure given ClickHouse's guarantees. Also note transactions have certain limitations as well (re: partitions).
I'm curious who your customers are. I work for a large tech company and we use Kafka and ClickHouse in our stack but we would generally build things in house.