Comment by dekhn

10 years ago

This also happens in compltely normal operation, like if you're using a TCP-based MPI implementation and do an all-versus-all message send. The destination buffers will fill quickly from all the senders, the receiver drops the packets, TCP sees that as a timeout after 250ms, and requests a retransmit. In principle, using PAUSE frames allows the sender to get feedback to pace its sends.

Took me a long time to debug my MPI performance problems because of this.

Uh, no. Alltoall is a challenge for MPI, but not for the reason you describe. TCP windows mean that the receivers aren't the problem. It's all the switch queues in the middle.

  • welp, I measured the problem on my machines, and enabling pause frames on the switches fixed the problem...

  • TCP windows won't save you. TCP has no way to magically know when some buffer is full. Instead it notices packet loss and interprets it as congestion. Which is not what you want, because it can significantly reduce throughput.