Comment by dfcowell
5 years ago
All of the components in the supply chain will be rated for greater than max load, however power generation at grid scale is a delicate balancing act.
I’m not an electrical engineer, so the details here may be fuzzy, however in broad strokes:
Grid operators constantly monitor power consumption across the grid. If more power is being drawn than generated, line frequency drops across the whole grid. This leads to brownouts and can cause widespread damage to grid equipment and end-user devices.
The main way to manage this is to bring more capacity online to bring the grid frequency back up. This is slow, since spinning up even “fast” generators like natural gas can take on the order of several minutes.
Notably, this kind of scenario is the whole reason the Tesla battery in South Australia exists. It can respond to spikes in demand (and consume surplus supply!) much faster than generator capacity can respond.
The other option is load shedding, where you just disconnect parts of your grid to reduce demand.
Any large consumers (like data center operators) likely work closely with their electricity suppliers to be good citizens and ramp up and down their consumption in a controlled manner to give the supply side (the power generators) time to adjust their supply as the demand changes.
Note that changes to power draw as machines handle different load will also result in changes to consumption in the cooling systems etc. making the total consumption profile substantially different coming from a cold start.
You're talking about the grid, the OP was talking about datacenter infrastructure -- which one is the weak link?
If a datacenter can't go from idle (but powered on) servers to fully utilized servers without taking down the power grid, then it seems that they'd have software controls in place to prevent this, since there are other failure modes that could cause this behavior other than a global Facebook outage.
Unfortunately the article doesn’t provide enough explicit detail to be 100% sure one way or the other, however my read is that it’s probably the grid.
> Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.
“Electrical systems” is vague and could refer to either internal systems, external systems or both.
That said, if the DC is capable of running under sustained load at peak (which we have to assume it is, since that’s its normal state when FB is operational) it seems to me like the externality of the grid is the more likely candidate.
In terms of software controls preventing this kind of failure mode, they do have it - load shedding. They’ll cut your supply until capacity is made available.