Comment by saisrirampur

5 days ago

Neat project! Quick question, will this work only if the entire row is a duplicate? Or even if just a set of columns (ex: primary key) conflict and you guarantee only presence of the latest version of the conflict? I’m assuming former because you are deduping before data is ingested into ClickHouse. I could be missing something, wanted to confirm.

- Sai from ClickHouse

Thanks, Sai! Great question. The deduplication works based on the user-defined key, not the entire row. You can specify which field (e.g. a primary key like event_id) to use as the deduplication key. Within a defined time window, GlassFlow guarantees that only the first event with a given key will be forwarded to ClickHouse. Subsequent duplicates are rejected. Our idea was to keep ClickHouse as clean as possible.

  • Got it. Thanks for the clarification. That might not work if the ingested row represents an UPDATE. We do this in Postgres CDC by replicating an UPDATE as a new version of the row and that is what you want to retain. For most customers using FINAL (with the correct ORDER KEY as needed) works well for deduplication and query performance is still great. But in cases where it isn't, customers typically resort to tuning faster merges with ReplacingMergeTree or Materialized Views (either aggregating or refreshable) to manage deduplication.

    Anyway, great work so far! I like how well you articulated the problem. Best wishes.