← Back to context

Comment by super_ar

5 months ago

Thanks, Sai! Great question. The deduplication works based on the user-defined key, not the entire row. You can specify which field (e.g. a primary key like event_id) to use as the deduplication key. Within a defined time window, GlassFlow guarantees that only the first event with a given key will be forwarded to ClickHouse. Subsequent duplicates are rejected. Our idea was to keep ClickHouse as clean as possible.

Got it. Thanks for the clarification. That might not work if the ingested row represents an UPDATE. We do this in Postgres CDC by replicating an UPDATE as a new version of the row and that is what you want to retain. For most customers using FINAL (with the correct ORDER KEY as needed) works well for deduplication and query performance is still great. But in cases where it isn't, customers typically resort to tuning faster merges with ReplacingMergeTree or Materialized Views (either aggregating or refreshable) to manage deduplication.

Anyway, great work so far! I like how well you articulated the problem. Best wishes.