Comment by ChuckMcM

10 years ago

Also one of the only ways to negotiate your way out of a spanning tree broadcast storm. Generally the firmware on the MAC will reflect a pause frame to the source when it's FIFO is full. That happens because the host is not pulling packets out of the FIFO fast enough, or the network has gone bonkers and is sending a gazillion packets per second.

The latter can happen when your misconfigured DHCP server gives out an address that other nodes on your network believe to be the broadcast address for the subnet. The device with that ill fated address will get deluged after every packet they send as people ack or nak or respond with queries. I saw that happen when a NetGear router had a netmask of 255.255.255.248 which the user copied from the WAN config to the LAN config, but the DHCP server was told the netmask was 255.255.255.0. Hilarity (not) ensued.

This also happens in compltely normal operation, like if you're using a TCP-based MPI implementation and do an all-versus-all message send. The destination buffers will fill quickly from all the senders, the receiver drops the packets, TCP sees that as a timeout after 250ms, and requests a retransmit. In principle, using PAUSE frames allows the sender to get feedback to pace its sends.

Took me a long time to debug my MPI performance problems because of this.

  • Uh, no. Alltoall is a challenge for MPI, but not for the reason you describe. TCP windows mean that the receivers aren't the problem. It's all the switch queues in the middle.

    • welp, I measured the problem on my machines, and enabling pause frames on the switches fixed the problem...

    • TCP windows won't save you. TCP has no way to magically know when some buffer is full. Instead it notices packet loss and interprets it as congestion. Which is not what you want, because it can significantly reduce throughput.

      4 replies →