Comment by YZF

5 days ago

Afaik this is always best effort, e.g.: https://clickhouse.com/docs/operations/settings/settings#ins...

"For the replicated tables by default the only 100 of the most recent blocks for each partition are deduplicated"

This doesn't work under failure conditions either (again afaik), e.g. if the clickhouse server fails.

100 is the default, and can be changed at runtime.

The deduplication works regardless of server restarts, and it does not matter when a request goes to another replica, as it is implemented with a distributed consensus (RAFT) via clickhouse-keeper.

  • ah. interesting. So some hash of the batch is recorded in the distributed log after the batch has been written? to disk? Isn't there still a race there?

    At least intuitively this seems very hard to guarantee something more than "at least once" but I might be missing something.

    • It is more complex.

      The batch is written to a temporary directory. Then we have to do an atomic commit into three different places: on disk, which can be external storage as well (rename the temporary directory), in memory (to the data structure containing the snapshot), and in Keeper (which contains metadata, including these hashes, and is the only source of truth).

      The metadata in Keeper is the only place that decides which data exists and when. The whole operation is done by committing the data files first, then committing to Keeper, then committing to memory, and then responding to the client.

      Now you have to analyze what happens, and what have to be resolved, if the server is killed between these steps.

      If it is killed after writing data files and before writing to Keeper, the data is not considered written, but we have garbage in the storage to collect.

      If it is killed after writing data files and after writing to Keeper, but before committing to memory or answering to the client, the data is written, but the client does not know that. This is the situation when the client has to retry, and the retry will be deduplicated.

      If the data is written to the storage, but the transaction to Keeper (the only source of truth) did not go through due to a network error after subsequent retries, the changes in the storage will be attempted to roll back straight ahead, and the client will receive an exception. If the client will not receive an exception due to another network error, the client will have to retry.

      It is quite easy, if you look at it as Keeper is the central place, and the only place that does distributed transactions, and everything else is around it.