Comment by oconnor663
23 days ago
As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.
You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.