Comment by nickcw

2 days ago

The fact that sftp is not the fastest protocol is well known by rclone users.

The main problem is that it packetizes the data and waits for responses, effectively re-implementing the TCP window inside a TCP stream. You can only have so many packets outstanding in the standard SFTP implementation (64 is the default) and the buffers are quite small (32k by default) which gives a total outstanding data of 2MB. The highest transfer rate you can make depends on the latency of the link. If you have 100 ms of latency then you can send at most 20 MB/s which is about 200 Mbit/s - nowhere near filling a fast wide pipe.

You can tweak the buffer size (up to 256k I think) and the number of outstanding requests, but you hit limits in the popular servers quite quickly.

To mitigate this rclone lets you do multipart concurrent uploads and downloads to sftp which means you can have multiple streams operating at 200 Mbit/s which helps.

The fastest protocols are the TLS/HTTP based ones which stream data. They open up the TCP window properly and the kernel and networking stack is well optimized for this use. Webdav is a good example.

"(sftp) packetizes the data and waits for responses, effectively re-implementing the TCP window inside a TCP stream."

why is it designed this way? what problems it's supposed to solve?

  • Here is some speculation:

    SFTP was designed as a remote file system system access protocol rather than transfer a single file like scp.

    I suspect that the root of the problem is that SFTP works over a single SSH channel. SSH connections can have multiple channels but usually the server binds a single channel to a single executable so it makes sense to use only a single channel.

    Everything flows from that decision - packetisation becomes necessary otherwise you have to wait for all the files to transfer before you can do anything else (eg list a directory) and that is no good for your remote filesystem access.

    Perhaps the packets could have been streamed but the way it works is more like an RPC protocol with requests and responses. Each request has a serial number which is copied to the response. This means the client can have many requests in-flight.

    There was a proposal for rclone to use scp for the data connections. So we'd use sftp for the day to day file listings, creating directories etc, but do actual file transfers with scp. Scp uses one SSH channel per file so doesn't suffer from the same problems as sftp. I think we abandoned that idea though as many sftp servers aren't configured with scp as well. Also modern versions of OpenSSH (OpenSSH 9.0 released April 2022) use SFTP instead of scp anyway. This was done to fix various vulnerabilities in scp as I understand.

  • Notably, the SFTP specification was never completed. We're working off of draft specs, and presumably these issues wouldn't have made it into a final version.

  • Because that is a poor characterization of the problem.

    It just has a in-flight message/queue limit like basically every other communication protocol. You can only buffer so many messages and space for responses until you run out of space. The problem there is just that the default amount of buffering is very low and is not adaptive to the available space/bandwidth.

    • Yeah, it's an issue because there is also the per channel application layer flow control. So when you are using SFTP you have the TCP flow control, the SSH layer flow control, and then the SFTP flow control. The maximum receive buffer ends up being the minimum of all three. HPN-SSH (I'm the dev) normalizes the SSH layer flow control to the TCP receive buffer but we haven't done enough work on SFTP except to bump up the buffer size/outstanding requests. I need to determine if this is effective enough or if I need some dynamism in there as well.

When you are limited to use SSH as the transport, you can still do better than using scp or sftp by using rsync with --rsh="ssh ...".

Besides being faster, with rsync and the right command options you can be certain that it makes exact file copies, together with any file metadata, even between different operating systems and file systems.

I have not checked if in recent years all the bugs of scp and sftp have been fixed, but some years ago there were cases when scp and sftp were losing silently, without warnings, some file metadata (e.g. high-precision timestamps, which were truncated, or extended file attributes).

I am using ssh every day, but there are decades since I have last used scp or sftp, with the exception of the cases when I have to connect to a server that I cannot control and where it happens that rsync is not installed. Even on such servers, if I may add an executable in my home directory, I first copy there an rsync with scp, then I do any other copies with that rsync.

  • I have the opposite opinion and experience: a simple file copy is pretty trivial with scp, but with rsync - it's a goddamn lottery. Too many options, too many possible modes and thus I am never sure about the outcome meeting my expectations.

    • You have to figure the correct options only once, then you should use forever the same alias or script for copying.

      For instance, I always use this:

        ALIAS='/usr/bin/rsync --archive --xattrs --acls --hard-links --progress --rsh="ssh -p PORT -l USER"'
      

      The default options of scp, like also those of cp or of any other UNIX copying program are bad, as they do not make exact copies.

      In decades of working with computers, I have never wanted to make any other kind of copies except exact copies, so I never use the default options of cp, scp, rsync etc., but I always use the same aliases for them, with the options needed for exact copies.

> The fastest protocols are the TLS/HTTP based ones which stream data.

I think maybe you are referring to QUIC [0]? It'd be interesting to see some userspace clients/servers for QUIC that compete with Aspera's FASP [1] and operate on a point to point basis like scp. Both use UDP to decrease the overhead of TCP.

0. https://en.wikipedia.org/wiki/QUIC

1. https://en.wikipedia.org/wiki/Fast_and_Secure_Protocol

  • We've been looking at using QUIC as the transport layer in HPN-SSH. It's more of a pain that you might think because it breaks the SSH authentication paradigm and requires QUIC layer encryption - so a naive implementation would end up encrypting the data twice. I don't want to do that. Mostly what we are thinking about doing is changing the channel multiplexing for bulk data transfers in order to avoid the overhead and buffer issues. If we can rely entirely on TCP for that then we should get even better performance.

    • Yeah, my naive implementation thought experiment was oriented towards a side channel brokered by the ssh connection using nginx and curl. Something like source opens nginx to share a file and tells sink via ssh to curl the file from source with a particular cert.

      However, I observed that curl [0] uses openssl' quic implementation (for one of its experimental implementations). Another backend for curl is Quiche [1] which has client and server components already, has the userspace crypto etc. It's a little confusing to me, but CloudFlare also has a project quiche [2] which is a Rust crate with a CLI to share and consume files.

      0. https://curl.se/docs/http3.html

      1. https://github.com/google/quiche/tree/main/quiche/quic

      2. https://github.com/cloudflare/quiche

  • Available QUIC implementations are very slow. MsQUIC is one of the fastest and can only reach a meager ~7 Gb/s [1]. Most commercial implementations sit in the 2-4 Gb/s range.

    To be fair, that is not really a problem of the protocol, just the implementations. You can comfortably drive 10x that bandwidth with a reasonable design.

    [1] https://microsoft.github.io/msquic/

    • Thank you for linking to those benchmarks. Its interesting that in OpenSSL upload, Windows userspace is 16% faster than Linux.

  • Actually the fastest ones in my experience are the HTTP/1.x ones. HTTP/2 is generally slower in rclone though I think that is the fault of the stlib not opening more connections. I haven't really tried QUIC

    I just think for streaming lots of data quickly HTTP/1.x plus TLS plus TCP has received many more engineering hours of optimization than any other combo.

    • Maybe this is one of those things where "Worse is Better" [0] given HTTP/1.x will always receive more time/attention/resources than something that might be theoretically superior but never got the resources to fulfill its promise. Cloudflare is probably one of the few organizations outside of Google with an internal economic case to support QUIC. For everyone else there is the option of paying IBM for Aspera using FASP.

      0. https://en.wikipedia.org/wiki/Worse_is_better

Besides limiting the length and number of outstanding IO requests, SFTP also rides on top of SSH, which also has a limited window size.