Comment by Johnny555
5 years ago
I don't see how suddenly running more traffic is going to trip datacenter breakers -- I could see how flipping on power to an entire datacenter's worth of servers could cause a spike in electrical demand that the power infrastructure can't handle, but if suddently running CPU's at 100% trips breakers, then it seems like that power infrastructure is undersized? This isn't a case where servers were powered off, they were idle because they had no traffic.
Do large providers like Facebook really provision less power than their servers would require at 100% utilization? Seems like they could just use fewer servers with power sized at 100% if their power system going to constrain utilization anyway?
All of the components in the supply chain will be rated for greater than max load, however power generation at grid scale is a delicate balancing act.
I’m not an electrical engineer, so the details here may be fuzzy, however in broad strokes:
Grid operators constantly monitor power consumption across the grid. If more power is being drawn than generated, line frequency drops across the whole grid. This leads to brownouts and can cause widespread damage to grid equipment and end-user devices.
The main way to manage this is to bring more capacity online to bring the grid frequency back up. This is slow, since spinning up even “fast” generators like natural gas can take on the order of several minutes.
Notably, this kind of scenario is the whole reason the Tesla battery in South Australia exists. It can respond to spikes in demand (and consume surplus supply!) much faster than generator capacity can respond.
The other option is load shedding, where you just disconnect parts of your grid to reduce demand.
Any large consumers (like data center operators) likely work closely with their electricity suppliers to be good citizens and ramp up and down their consumption in a controlled manner to give the supply side (the power generators) time to adjust their supply as the demand changes.
Note that changes to power draw as machines handle different load will also result in changes to consumption in the cooling systems etc. making the total consumption profile substantially different coming from a cold start.
You're talking about the grid, the OP was talking about datacenter infrastructure -- which one is the weak link?
If a datacenter can't go from idle (but powered on) servers to fully utilized servers without taking down the power grid, then it seems that they'd have software controls in place to prevent this, since there are other failure modes that could cause this behavior other than a global Facebook outage.
Unfortunately the article doesn’t provide enough explicit detail to be 100% sure one way or the other, however my read is that it’s probably the grid.
> Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.
“Electrical systems” is vague and could refer to either internal systems, external systems or both.
That said, if the DC is capable of running under sustained load at peak (which we have to assume it is, since that’s its normal state when FB is operational) it seems to me like the externality of the grid is the more likely candidate.
In terms of software controls preventing this kind of failure mode, they do have it - load shedding. They’ll cut your supply until capacity is made available.
The key word is "suddenly".
In the electricity grid, demand and generation must always be precisely matched (otherwise, things burn up). This is done by generators automatically ramping up or down whenever the load changes. But most generators cannot change their output instantly; depending on the type of generator, it can take several minutes or even hours to respond to a large change in the demand.
Now consider that, on modern servers, most of the power consumption is from the CPU, and also there's a significant difference on the amount of power consumed between 100% CPU and idle. Imagine for instance 1000 servers (a single rack can hold 40 servers or more), each consuming 2kW of power at full load, and suppose they need only half that at idle (it's probably even less than half). Suddenly switching from idle to full load would mean 1MW of extra power has to be generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate (unlike incandescent lamps, switching power supplies try to maintain the same output no matter the input voltage), and breakers (which usually are configured to trip on excess current) can trip (without breakers, the wiring would overheat and burn up or start a fire).
If the load changes slowly, on the other hand, there's enough time for the governor on the generators to adjust their power source (opening valves to admit more water or steam or fuel), and overcome the inertia of their large spinning mass, before the voltage drops too much.
>generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate...
Close- you won't see an increase in load of a synchronous machine operating at constant throttle manifest as a voltage sag, you'll see it manifest as a decrease in frequency (this generators literally slow down, like a guy on a bike going uphill). Voltage sags are more related to transmission line phenomenon.
I get that lots of servers can add up to lots of power, but what is a "lot"? Is 1MW really enough demand to destabilize a regional power grid?
No. All balancing authorities are required to keep a certain amount of "spinning reserve" available for fast adjustments like this. But if I do it and the next guy does it and a transmission like is down and...etc
A lot of horror stories start that way.
If it's all at once at the end of one leg and unplanned? Yes.
The question is somewhat similar to a thought experiment. If a ship is docked and loading cargo, is it a good idea to use all the cranes to suddenly fill up one outer side of the ship?
I don't know the answer. But it's not too uncommon, in general, to provision for reasonable use cases plus a margin, rather than provision for worst case scenario.