Comment by nh2

16 hours ago

Is it certain that this is the reason?

rsync's man page says "pipelining of file transfers to minimize latency costs" and https://rsync.samba.org/how-rsync-works.html says "Rsync is heavily pipelined".

If pipelining is really in rsync, there should be no "dead time between transfers".

15 comments

nh2

dekhn 14 hours ago

The simple model for scp and rsync (it's likely more complex in rsync): for loop over all files. for each file, determine its metadata with fstat, then fopen and copy bytes in chunks until done. Proceed to next iteration.

I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).

I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel, which I do not think it does, while tools like rclone, kopia, and aws sync all take advantage of parallelism (multiple ongoing file lookups and copies).

mschuster91 8 hours ago

> I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).
That's because of fast paths:
- For a large file, assuming the disk isn't fragmented to hell and beyond, there isn't much to do for rsync / the kernel: the source reads data and copies it to the network socket, the receiver copies data from the incoming network socket to the disk, the kernel just dumps it in sequence directly to the disk, that's it.
- The slightly less performant path is on a fragmented disk. Source and network still doesn't have much to do, but the kernel has a bit more work every now and then to find a contiguous block on the disk to write the data to. For spinning rust HDDs, the disk also has to do some seeking.
- Many small files? Now that's more nasty. First, the source side has to do a lot of stat(2) calls to get basic attributes of the file. For HDDs, that seeking can incur a sometimes significant latency penalty as well. Then, this information needs to be transferred to the destination, the destination has to do the same stat call again, and then the source needs to transfer the data, involving more seeking, and the destination has to write it.
- The utter worst case is when the files are plenty and small, but large enough to not fit into an inode as inline data [1]. That means two writes and thus seeks per small file. Utterly disastrous for performance.
And that's before stepping into stuff such as systems disabling write caches, soft-RAID (or the impact of RAID in general), journaling filesystems, filesystems with additional metadata...
[1] https://archive.kernel.org/oldwiki/ext4.wiki.kernel.org/inde...
nh2 14 hours ago
> I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel
No, that is not the question. Even Wikipedia explains that rsync is single-threaded. And even if it was multithreaded "or otherwise" used concurent file IO:
The question is whether rsync _transmission_ is pipelined or not, meaning: Does it wait for 1 file to be transferred and acknowledged before sending the data of the next?
Somebody has to go check that.
If yes: Then parallel filesystem access won't matter, because a network roundtrip has brutally higher latency than reading data sequentially of an SSD.
- dekhn 14 hours ago
  
  Note that rsync on many small files is slow even within the same machine (across two physical devices), suggesting that the network roundtrip latency is not the major contributor.
  
  1 reply →
- Dylan16807 9 hours ago
  
  The filesystem access and general threading is the question because transmission is pipelined and not a thing "somebody has to go check". You just quoted the documentation for it.
  The dead time isn't waiting for network trips between files, it's parts of the program that sometimes can't keep up with the network.
  
  2 replies →

spockz 15 hours ago

I’m not sure why, but just like with scp, I’ve achieved significant speeds ups by tarring the directory first (optionally compressing it), transferring and then decompressing. Maybe because it makes the tar and submit, and the receive, untar/uncompress, happen on different threads?

poke646 10 hours ago
One of my "goto" tools is copying files over a "tar pipe". This avoids the temporary tar file. Something like:
tar cf - *.txt | ssh user@host tar xf - -C /some/dir/
dnmc 5 hours ago

I've never verified this, but it feels like scp starts a new TCP connection per file. If that's the case, then scp-ing a tarred directory would be faster because you only hit the slow start once. https://www.rfc-editor.org/rfc/rfc5681#section-3.1
ndsipa_pomu 1 hour ago

Also handy to note that tar can handle sparse files, whereas scp doesn't.
lelandbatey 14 hours ago
It's typically a disk-latency thing, as just stat-ing the many files in a directory can have significant latency implications (especially on spinning HDDs) vs opening a single file (the tar) and read-()ing that one file in memory before writing to the network.
If copying a folder with many files is slower than tarring that folder and the moving the tar (but not counting the untar) then disk latency is your bottleneck.
- ahartmetz 13 hours ago
  
  Not useful very often, but fast and kind of cool: You can also just netcat the whole block device if you wanted a full filesystem copy anyway. Optionally zero all empty space before using a tool like zerofree and use on-the-fly compression / decompression with lz4 or lzo. Of course, none of the block devices should be mounted, though you could probably get away with a source that's mounted read-only.
  dd is not a magic tool that can deal with block devices while others can't. You can just cp myLinuxInstallDisk.iso to /dev/myUsbDrive, too.
- spockz 12 hours ago
  
  Okay. In this case the whole operation is faster end to end. That includes the time it takes to tar and untar. Maybe those programs do something more efficient in disk access than scp and rsync?