← Back to context

Comment by AdamJacobMuller

13 hours ago

This is almost definitely an issue of equipment failure.

Cooling in datacenters is like everything else both over and under provisioned.

It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.

The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.

In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.

The problem comes when you have the intersection of multiple events.

You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.

Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).

The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.

Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.

That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.

Still, not catastrophic because really you designed for 200% of average load.

The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.

Your load starts ramping up.

Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.

What happens next is the confluence of events which puts you in the news.

One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.

They spin up 10000 new VMs.

Normally, this is fine, you have the spare capacity.

But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.

Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.

Boom, cascading failure, your cooling is now N-4.

Server fans start ramping up faster which consumes more power.

Your cooling is now N-5.

Alarms are blaring all over the place.

Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.

Your cooling is now N-6.

Your cooling is now N-7.

Your cooling is now 0.

This is a great writeup! thank you!!

Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted

  • The cat incident at a facility I worked at.

    Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).

    And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.

I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.

  • Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.

    But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.

    • > But they did load-shed

      Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".

      This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.

      spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.

  • Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"

    A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.

This is written beautifully. It's like a much more inconsequential variant of Chernobyl.

I would have thought with all the data centers being built the parts for cooling systems would be standardized with replacements available from Grainger immediately.

Shouldn't there be a feedback system here preventing the scheduling of loads when cooling is degraded?

  • With hyperscalers for sure.

    But this is the physical world, shit happens.

    The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.