← Back to context

Comment by bob1029

12 hours ago

They've always been terrible at VM ops. I never get weird quota limits and errors in other places. It's almost as if Amazon wants me to be a customer and Microsoft does not.

Amazon isn't much better there. Wait until you hit an EC2 quota limit and can't get anyone to look at it quickly (even under paid enterprise support) or they say no.

Also had a few instance types which won't spin up in some regions/AZs recently. I assume this is capacity issues.

  • The cloud isn’t some infinite thing.

    There’s a bunch of hardware, and they can’t run more servers than they have hardware. I don’t see a way around that.

    • I was surprised hitting one of these limits once, but it wasn't as if they were 100% out of servers, just had to pick a different node type. I don't think they would ever post their numbers, but some of the more exotic types definitely have less in the pool.

      1 reply →

    • Really prefer Hetzner in this sense because they actually talk about limits. I recently got myself a hetzner account (after shilling it for so much, hearing positivity, I felt like it was time for me to discover it)

      I wanted to try out the most cheapest option out of frugality & that was actually limited (but kudos to them that they mentioned that these servers have limits) so no worries I went and picked the 5.99 euro instead of the 3.99 euro option instead.

      They also have limits option itself as a settings iirc and it shows you all the limits that are imposed in a transparent manner and my account's young so I can't request for limit increases but after some time, one definitely can.

      Essentially I love this idea because essentially Cloud is just someone's else's hardware and there is no infinitium. But I feel as if it can come pretty close with hetzner (and I have heard some great things about OVH and have a good personal experience with netcup vps but netcup's payments were really PITA to setup]

Agreed...I've been waiting for months now to increase my quota for a specific Azure VM type by 20 cores. I get an email every two weeks saying my request is still backlogged because they don't have the physical hardware available. I haven't seen an issue like this with AWS before...

  • We've ran into that issue as well, ended up having to move regions entirely because nothing was changing in the current region. I believe it was westus1 at the time. It's a ton of fun to migrate everything over!

    That’s was years ago, wild to see they have the same issues.

It's awful. Any other service in Azure that relies on the core systems seems to have issues trying to depend on it, I feel for those internal teams.

Ran into an issue upgrading an AKS cluster last week. It completely stalled and broke the entire cluster in a way where our hands were tied as we can't see the control plane at all...

I submit a severity A ticket and 5 hours later I get told there was a known issue with the latest VM image that would create issues with the control plane leaving any cluster that was updated in that window to essentially kill itself and require manual intervention. Did they notify anyone? Nope, did they stop anyone from killing their own clusters. Nope.

It seems like every time I'm forced to touch the Azure environment I'm basically playing Russian roulette hoping that something's not broken on the backend.

  • It's nice to buy responsibility when it's upheld, else you're just trading your money for the inability to fix things.

How is Azure still having faults that affect multiple regions? Clearly their region definition is bollocks.

  • All 3 hyperscalers have vulnerabilities in their control planes: they're either single point of failure like AWS with us-east-1, or global meaning that a faulty release can take it down entirely; and take AZ resilience to mean that existing compute will continue to work as before, but allocation of new resources might fail in multi-AZ or multi-region ways.

    It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.

    • > It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.

      That sounds oddly similar to owning hardware.

      1 reply →

    • This outage talks about what appears to be a VM control plane failure (it mentions stop not working) across multiple regions.

      AWS has never had this type of outage in 20 years. Yet Azure constantly had them.

      This is a total failure of engineering and has nothing to do with capacity. Azure is a joke of a cloud.

      3 replies →