Comment by the_arun

2 months ago

Congratulations!!

Questions:

1. Why only to ClickHouse, can’t we make it generic for any DB? Or is it reference implementation for ClickHouse?

2. Similarly, why only from Kafka?

3. Any default load testing done?

3 comments

the_arun

ashishbagri 2 months ago

Thanks for taking a look! 1. The current implementation is just for clickhouse as we started with the segment of users building real time analytics with clickhouse in their stack. However we already learned during the way that streaming deduplication is a challenge for other destination databases as well. The architecture of our tool is designed in a way that we can extend the sinks and add additional destinations. We would just have to write the sink component specific for that database. Do you have a specific DB in mind that you would like to use?

2. Again, we started with kafka because of our early target users. But the architecture inherently supports adding multiple sources. We already have experience in building multiple source and sink connectors (from our previous project) so adding additional sources would not be so challenging. which source do you have in mind?

3. Yes, running the tool locally on a macbook pro M2 docker, it was able to handle 15k requests per second. We have built a load testing infrastructure and happy to share the code if you are interested.

nine_k 2 months ago

AFAICT, there are native connector implementations for ClickHouse and Kafka, so it's plug and play with them specifically.

OTOH for deduplication you mostly need timestamps and a good hash (like SHA512), you don;t need to store the actual messages, so a naive approach should work with basically any even source; all you need is to look up the hash, compare the timestamps, and skip the message if the hashes match. But you need to write your own ingestion and output logic, maybe emulating whatever protocol you're using if you want the whole thing to be a drop-in node in your pipeline.

ashishbagri 2 months ago

Yes its true that if you just want to send data from Kafka to clickhouse and do not worry about duplicates, then there are several ways. we even covered them in a blog post -> https://www.glassflow.dev/blog/part-1-kafka-to-clickhouse-da...
However, the reason for us to start building this was because duplication is a sad reality in streaming pipelines and the methods to clean up duplicates on clickhouse is not good enough (again covered extensively on our blog with references to cickhouse docs).
The approach you mention about deduplication is 100% accurate. The goal in building this tool is to enable a drop-in node for your pipeline (just as you said) with optimised source and sink connectors for reliability and durability