Comment by fabian2k
16 hours ago
I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.
Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.
But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.
> But they did load-shed
Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".
This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.
spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.
Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"
A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.