← Back to context

Comment by cmiles8

12 hours ago

AWS’s US-East 1 continues to be the Achilles heel of the Internet.

And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.

The idea that AWS's services are fully regionalized or isolated has always been a myth.

All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.

  • Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.

    But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.

  • Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?

    • They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.

      Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.

  • Isn't this kind of circular dependency what lead to extended downtime a while back?

    • It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.

    • It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.

      8 replies →

    • when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.

      I'm glad I never had to get that deep into the failure chain.

  • > And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

    When you dogfood your own Rube Goldberg machine.

    • We should let the IAM service team know if this glaring gap the hn thread figured out /s

      I’m 99% ;) certain dependencies of foundational services are a well discussed topic

People say this, but this this was just a single AZ, and in the last 3 years of running my startup mostly out of use-1, and we've only had one regional outage, and even that was partial, with most instances uneffected.

And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

  • >And honestly, everybody else's stuff is in use-1

    Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.

    • Not OP, but I do single-region us-east-1 for a few reasons:

      1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.

      2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.

      3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.

      4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.

    • 90% of customers are located in use-1. Latency to use-1 is more important than being up when everyone else is down.

  • none of my stuff is in us-east-1. I chose that specifically 15 years ago. Been a great decision.

Too many people are using it.

In fantasy magic dream land loads are distributed evenly across different cloud providers.

A single point of failure doesn't exist.

It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.

Healthcare in the US is affordable.

All types of magical stuff exist here.

But no. It's another day. AWS US-East 1 can take town most of the internet.

  • Core AWS services use it too. Even if you are hosted in another region, you can still be affected by a US-East 1 outage

  • > It worked out with my first girlfriend. The twins are fluent in English and Korean.

    You were dating twins as a form of redundancy?!

I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.

anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...

  • It's no magic given the size of us-east-1 there is no spare capacity to absorb all the workloads

    • One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.

Is it really failing more, or we just don’t hear about failure happening elsewhere?

Last i heard azure outage it wasn’t even on HN frontpage

  • It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.

    I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.

    It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.