Comment by liquidgecka
6 years ago
I worked at Google way way back when. We had an emergency code red situation where dozens of engineers from all over the company had to sit in a room and figure out what was making out network overload. After a bit of debugging it became clear that Gmail services where talking to Calendar services with an exceeding amount of traffic that nobody would have expected.A little debugging later and it became clear that restarting the gmail server fixed the issue. One global rolling restart later and all was well.
But then the debugging started. Turns out the service discovery component would health check backend destinations once a second. This was fine as it made sure we would never try to call against a server that was long gone. The bug was that it never stopped health checking a backend. Even if the service discovery had removed a host from the pool long ago. Gmail had stopped deploying while it got ready for Christmas, and Calendar was doing a ton of small stability improvement deploys. We created the perfect storm for this specific bug.
The most alarming part? This bug existed in the shared code that did RPC calls/health checking for all services across Google and had existed for quite a long time. In the end though, Gmail almost took Google offline by not deploying. =)
Statistics being what they are, eventually you will have, in the same build, an unfixed bug that requires a restart, and an unfixed bug that only works until you restart. That is never a fun day.