← Back to context

Comment by parasubvert

3 years ago

In my decades of experience in telco, capital markers, and core banking, unexplained latency spikes of hundreds of ms are usually analyzed to death as they can have ripple effects. I’ve had 36 hour severity 1 incidents with multiple VPs taking notes on 24/7 conference calls when a distributed system starts showing latency spikes in the 400ms range.

No, the system isn’t going haywire, but 200-400ms is concerning inside a datacenter for core apps.

But let’s forget IT apps, let’s talk about the network. In a network 200ms is catastrophic.

Presumably you know BGP is the very popular distributed system that converges Internet routes?

Inside a datacenter the Bidirectional Forwarding Protocol (BFD) is used to drop BGP convergence times to be sub-second if you’re using it as an IGP. BFD is also useful with other protocols but anyway. It has heartbeats of 100-300ms. If there’s a fluctuation of the network 3x that interval, it will drop the link and trigger a round of convergence. This is essential in core networks or telco 4G/5G transport networks.

Of course, flapping can be the consequence of setting too low an interval. Tradeoffs.

Back to the original point, I’ve contributed to the code of equity and bond trading apps, telco apps, core banking systems. And cloud/Kubernetes systems. All RPC distributed systems. Every. Single. One. That performed well… For 30 years! Has enabled TCP_NODELAY. Except when serving up large amounts of streaming data. And the reason fundamentally is that most of the time you have less control over client settings (delayed TCP acks), so it’s easier to control the server.