Comment by dmoy

5 years ago

Interesting bit on recovery w.r.t. the electrical grid

> flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems ...

I wish there was a bit more detail in here. What's the worst case there? Brownouts, exploding transformers? Or less catastrophic?

Brownouts is probably the most proximate concern - a sudden increase in demand will draw down the system frequency in the vicinity, and if there aren't generation units close enough or with enough dispatchable capacity there's a small chance they would trip a protective breaker.

A person I know on the power grid side said at one data center there were step functions when FB went down and then when it came up, equal to about 20% of the load behind the distribution transformer. That quantity is about as much as an aluminum smelter switching on or off.

  • > That quantity is about as much as an aluminum smelter switching on or off.

    Interestingly, the mountains east of Portland OR, where all the Aluminum smelters used to be, are now full of FAANG datacenters relying on the power infrastructure (and pricing) the Aluminum industry used to use...

    https://www.oregonlive.com/silicon-forest/2015/10/small-town...

    And Washington state too:

    https://www.bizjournals.com/seattle/blog/techflash/2015/11/p...

    • That's pretty interesting, I'm sure those aluminium foundries would need to be careful about turning the power on as well.

      Tangentially related, aluminium production in the Netherlands may shut down soon; because of a sudden spike in gas prices (due to mismanagement), electricity prices have also gone up, making producing aluminium no longer cost-effective. €2400 in electricity to produce a ton of aluminium worth €2500 kinda cost effectiveness.

      I wouldn't be surprised if the big datacenters here will try and offload some of their workloads to datacenters elsewhere with lower energy costs. Mind you, I'm pretty sure these datacenters make long-running deals on electricity prices.

  • But don't their datacenters all have backup generators? So worst case in a brownout, they fail over to generator power, then can start to flip back to utility power slowly.

    Or do they forgo backup generators and count on shifting traffic to a new datacenter if there's a regional power outage?

    • Edit to be less snarky:

      I assume they do have backup generators, though I don’t know.

      However if the sudden increase put that much load on the grid it could drop the frequency enough to blackout the entire neighborhood. That would be bad even if FB was able to keep running through it.

    • For outages the generatos are great but I'm not sure how they assist with brownouts unless they can start instantly or are constantly running to provide a buffer.

      Short term they'd help but an instantaneous or unexpected massive traffic/CPU usage/user surge might pop too fast for the generators to start and kick in properly. Also, it might not be good for those big generators to start and stop over and over vs bringing infra back online in waves to limit spikes.

      3 replies →

If your system is pulling 500 watts at 120V, that's around 4A of line voltage. If you drop down 20% to 100V, the output will happily still pull its regulated voltage, but now the line components are seeing ~20% more, at 5A. For brown out, you need to overrate your components, and/or shut everything off if the line voltage goes too low.

I used to do electrical compliance testing in a previous life, with brown out testing being one of our safety tests. You would drape a piece of cheese cloth over the power supply and slowly ramp the line voltage down. At the time, the power supplies didn't have good line side voltage monitoring. There was almost always smoke, and sometimes cheese cloth fires. Since this was safety testing, pass/fail was mostly based on if the cheese cloth caught fire, not if the power supply was damaged.

  • "output will happily still pull its regulated voltage" you mean power, right?

    • All standard computer components require a regulated voltage, then they consume power as a consequence of their operation. The steady voltage is required because the transistors in ICs will break down if voltages go too high, or stop operating if they go too low. Forcing something like an IC to always use the same amount of power, even if it were idle, isn't really possible, because nobody would build it that way.

I’m very close with someone who works at a FB data center and was discussing this exact issue.

I can only speak to one problem I know of (and am rather sure I can share): a spike might trip a bunch of breakers at the data center.

BUT, unlike me at home, FBs policy is to never flip a circuit back on until you’re positive of the root cause of said trip.

By itself that could compound issues and delay ramp up time as they’d work to be sure no electrical components actually sorted/blew/etc. A potentially time sucking task given these buildings could be measured in whole units of football fields.

Likely tripping breakers or overload protection on UPSes?

Often PDUs used in a rack can be configured to start servers up in a staggered pattern to avoid a surge in demand for these reasons.

I'd imagine there's more complications when you're doing an entire DC vs just a single rack, though.

  • I don't see how suddenly running more traffic is going to trip datacenter breakers -- I could see how flipping on power to an entire datacenter's worth of servers could cause a spike in electrical demand that the power infrastructure can't handle, but if suddently running CPU's at 100% trips breakers, then it seems like that power infrastructure is undersized? This isn't a case where servers were powered off, they were idle because they had no traffic.

    Do large providers like Facebook really provision less power than their servers would require at 100% utilization? Seems like they could just use fewer servers with power sized at 100% if their power system going to constrain utilization anyway?

    • All of the components in the supply chain will be rated for greater than max load, however power generation at grid scale is a delicate balancing act.

      I’m not an electrical engineer, so the details here may be fuzzy, however in broad strokes:

      Grid operators constantly monitor power consumption across the grid. If more power is being drawn than generated, line frequency drops across the whole grid. This leads to brownouts and can cause widespread damage to grid equipment and end-user devices.

      The main way to manage this is to bring more capacity online to bring the grid frequency back up. This is slow, since spinning up even “fast” generators like natural gas can take on the order of several minutes.

      Notably, this kind of scenario is the whole reason the Tesla battery in South Australia exists. It can respond to spikes in demand (and consume surplus supply!) much faster than generator capacity can respond.

      The other option is load shedding, where you just disconnect parts of your grid to reduce demand.

      Any large consumers (like data center operators) likely work closely with their electricity suppliers to be good citizens and ramp up and down their consumption in a controlled manner to give the supply side (the power generators) time to adjust their supply as the demand changes.

      Note that changes to power draw as machines handle different load will also result in changes to consumption in the cooling systems etc. making the total consumption profile substantially different coming from a cold start.

      2 replies →

    • The key word is "suddenly".

      In the electricity grid, demand and generation must always be precisely matched (otherwise, things burn up). This is done by generators automatically ramping up or down whenever the load changes. But most generators cannot change their output instantly; depending on the type of generator, it can take several minutes or even hours to respond to a large change in the demand.

      Now consider that, on modern servers, most of the power consumption is from the CPU, and also there's a significant difference on the amount of power consumed between 100% CPU and idle. Imagine for instance 1000 servers (a single rack can hold 40 servers or more), each consuming 2kW of power at full load, and suppose they need only half that at idle (it's probably even less than half). Suddenly switching from idle to full load would mean 1MW of extra power has to be generated; while the generators are catching up to that, the voltage drops, which means the current increases to compensate (unlike incandescent lamps, switching power supplies try to maintain the same output no matter the input voltage), and breakers (which usually are configured to trip on excess current) can trip (without breakers, the wiring would overheat and burn up or start a fire).

      If the load changes slowly, on the other hand, there's enough time for the governor on the generators to adjust their power source (opening valves to admit more water or steam or fuel), and overcome the inertia of their large spinning mass, before the voltage drops too much.

      4 replies →

    • I don't know the answer. But it's not too uncommon, in general, to provision for reasonable use cases plus a margin, rather than provision for worst case scenario.

  • Disk arrays have been staggering drive startup for a long time for this reason. Sinking current into hundreds of little starting motors simultaneously is a bad idea.

Think Thundering Herd problem on the scale you already know from context. Partial service is a kind of backpressure.

One case is automated protection systems in the grid detecting a sudden hop of current and assuming an isolation failure along the path - basically, not enough current to trip the short-circuit breakers, but enough to raise an alarm.

  • This isn't really a thing. Transmission and distribution protection doesn't operate on any kind of di\dt basis, other than those defined by overcurrent, in which case the line trips. A sudden increase in load will just manifest in ACE (area control error) as a load imbalance and be dealt with by increasing generation from the spinning reserve that the balancing authority is required to have on hand.