← Back to context

Comment by kube-system

5 years ago

Likely tripping breakers or overload protection on UPSes?

Often PDUs used in a rack can be configured to start servers up in a staggered pattern to avoid a surge in demand for these reasons.

I'd imagine there's more complications when you're doing an entire DC vs just a single rack, though.

I don't see how suddenly running more traffic is going to trip datacenter breakers -- I could see how flipping on power to an entire datacenter's worth of servers could cause a spike in electrical demand that the power infrastructure can't handle, but if suddently running CPU's at 100% trips breakers, then it seems like that power infrastructure is undersized? This isn't a case where servers were powered off, they were idle because they had no traffic.

Do large providers like Facebook really provision less power than their servers would require at 100% utilization? Seems like they could just use fewer servers with power sized at 100% if their power system going to constrain utilization anyway?

  • All of the components in the supply chain will be rated for greater than max load, however power generation at grid scale is a delicate balancing act.

    I’m not an electrical engineer, so the details here may be fuzzy, however in broad strokes:

    Grid operators constantly monitor power consumption across the grid. If more power is being drawn than generated, line frequency drops across the whole grid. This leads to brownouts and can cause widespread damage to grid equipment and end-user devices.

    The main way to manage this is to bring more capacity online to bring the grid frequency back up. This is slow, since spinning up even “fast” generators like natural gas can take on the order of several minutes.

    Notably, this kind of scenario is the whole reason the Tesla battery in South Australia exists. It can respond to spikes in demand (and consume surplus supply!) much faster than generator capacity can respond.

    The other option is load shedding, where you just disconnect parts of your grid to reduce demand.

    Any large consumers (like data center operators) likely work closely with their electricity suppliers to be good citizens and ramp up and down their consumption in a controlled manner to give the supply side (the power generators) time to adjust their supply as the demand changes.

    Note that changes to power draw as machines handle different load will also result in changes to consumption in the cooling systems etc. making the total consumption profile substantially different coming from a cold start.

    • You're talking about the grid, the OP was talking about datacenter infrastructure -- which one is the weak link?

      If a datacenter can't go from idle (but powered on) servers to fully utilized servers without taking down the power grid, then it seems that they'd have software controls in place to prevent this, since there are other failure modes that could cause this behavior other than a global Facebook outage.

      1 reply →

  • The key word is "suddenly".

    In the electricity grid, demand and generation must always be precisely matched (otherwise, things burn up). This is done by generators automatically ramping up or down whenever the load changes. But most generators cannot change their output instantly; depending on the type of generator, it can take several minutes or even hours to respond to a large change in the demand.

    Now consider that, on modern servers, most of the power consumption is from the CPU, and also there's a significant difference on the amount of power consumed between 100% CPU and idle. Imagine for instance 1000 servers (a single rack can hold 40 servers or more), each consuming 2kW of power at full load, and suppose they need only half that at idle (it's probably even less than half). Suddenly switching from idle to full load would mean 1MW of extra power has to be generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate (unlike incandescent lamps, switching power supplies try to maintain the same output no matter the input voltage), and breakers (which usually are configured to trip on excess current) can trip (without breakers, the wiring would overheat and burn up or start a fire).

    If the load changes slowly, on the other hand, there's enough time for the governor on the generators to adjust their power source (opening valves to admit more water or steam or fuel), and overcome the inertia of their large spinning mass, before the voltage drops too much.

    • >generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate...

      Close- you won't see an increase in load of a synchronous machine operating at constant throttle manifest as a voltage sag, you'll see it manifest as a decrease in frequency (this generators literally slow down, like a guy on a bike going uphill). Voltage sags are more related to transmission line phenomenon.

  • I don't know the answer. But it's not too uncommon, in general, to provision for reasonable use cases plus a margin, rather than provision for worst case scenario.

Disk arrays have been staggering drive startup for a long time for this reason. Sinking current into hundreds of little starting motors simultaneously is a bad idea.