Comment by caust1c

2 months ago

How does the deduplication itself work? The blog didn't have many details.

I'm curious because it's no small feat to do scalable deduplication in any system. You have to worry about network latencies if your deduplication mechanism is not on localhost, the partitioning/sharding of data in the source streams, and handling failures writing to the destination successfully, all of which cripples throughput.

I helped maintain the Segmentio deduplication pipeline so I tend to be somewhat skeptical of dedupe systems that are light on details.

https://www.glassflow.dev/blog/Part-5-How-GlassFlow-will-sol...

https://segment.com/blog/exactly-once-delivery/

2 comments

caust1c

ashishbagri 2 months ago

Thanks for your question. In GlassFlow, we use NATs Jetstream to power deduplication (and KV store for joins as well). I see from your blog post that segment used rocksDB to power their deduplication pipeline. We actually considered using rocksDB but used NATs JS because of added complexity in scaling with rocksDB (as rocksDB is embedded in the worker process). Their indeed is a small network latency in our deduplication pipeline but our end-end latency measured is under 50ms.

caust1c 2 months ago

Thanks for clarifying, best of luck!