Comment by hodgesrm

2 months ago

How is this better than using ReplacingMergeTree in ClickHouse?

RMT dedups automatically albeit with a potential cost at read time and extra work to design schema for performance. The latter requires knowledge of the application to do correctly. You need to ensure that keys always land in the same partition or dedup becomes incredibly expensive for large tables. These are issues to be sure but have the advantage that the behavior is relatively easy to understand.

Edit: clarity

2 comments

hodgesrm

super_ar 2 months ago

Good question! RMT does deduplication, but its dependency on background merges that you can't control can lead to incorrect results in queries until the merge is complete. We wanted something that cleans the duplicates in real time. GlassFlow moves deduplication upstream, before data hits ClickHouse. If you think of it from a pipeline perspective, we believe it is easier to understand, as it is a block before ClickHouse.

hodgesrm 2 months ago

RMT does not depending on background merges completing to give correct results as long as you use FINAL to force merge on read. The tradeoff is that performance suffers.
I'm a fan of what you are trying to do but there are some hard tradeoffs in dedup solutions. It would be helpful if your site defined exactly what you mean by deduplication and what tradeoffs you have made to solve it. This includes addressing failures in clustered Kafka / ClickHouse, which is where it becomes very hard to ensure consistency.