Comment by ram_rar

2 months ago

What uses cases would this be effective compared to using replacing merge tree (RMT) in clickhouse that eventually (usually in a short period of time) can handle dups itself? We had issues with dups that we solved using RMT and query time filtering.

2 comments

ram_rar

super_ar 2 months ago

Great question! RMT can work well when eventual consistency is acceptable and real-time accuracy isn't critical. But in use cases where results need to be correct immediately (dashboards, alerts, monitoring, etc.), waiting on background merges doesn't work.

Here 2 more detailed examples:

Real-Time fraud detection in logistics: Let's say you are streaming events from multiple sources (payments, GPS devices, user actions) for a dashboard that should trigger alerts when anomalies happen. Now you have duplicates (retries, partial system failure, etc.). Relying on RMT means incorrect counts until merges happen. This situation can lead to missed fraud, later interventions, etc.

Event collection from multi-systems like CRM + E-commerce + Tracking: Similar user or transaction data can come from multiple systems (e.g., CRM, Shopify, internal event logs). The same action might appear in slightly different formats across streams, causing duplicates in Kafka. ClickHouse can store these, but it doesn't enforce primary keys, so you end up with misleading results until RMT resolves.