Comment by shadowgovt

3 years ago

> The first lesson I learned about Distributed Systems Engineering is the network is never reliable

Yep, and it's a good rule. It's the one Google applies across datacenters.

... but within a datacenter (i.e. where most Go servers are speaking to each other, and speaking to the world-accessible endpoint routers, which are not written in Go), the fabric is assumed to be very clean. If the fabric is not clean, that's a hardware problem that SRE or HwOps needs to address; it's not generally something addressed by individual servers.

(In other words, were the kind of unreliability the article author describes here on their router to occur inside a Google datacenter, it might be detected by the instrumentation on the service made of Go servers, but the solution would be "If it's SRE-supported, SRE either redistributes load or files a ticket to have someone in the datacenter track down the offending faulty switch and smash it with a hammer.")