Comment by cmiles8

12 hours ago

AWS’s US-East 1 continues to be the Achilles heel of the Internet.

And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.

49 comments

cmiles8

dlenski 8 hours ago

The idea that AWS's services are fully regionalized or isolated has always been a myth.

All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.

Roark66 2 hours ago

Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.
But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.
zaphirplane 2 hours ago
Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?
- cmiles8 23 minutes ago
  
  They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.
  Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.
sidewndr46 8 hours ago
Isn't this kind of circular dependency what lead to extended downtime a while back?
- superjan 5 hours ago
  
  It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.
- jethro_tell 8 hours ago
  
  It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
  
  8 replies →
- grogenaut 7 hours ago
  
  when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.
  I'm glad I never had to get that deep into the failure chain.
stephenr 6 hours ago
> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
When you dogfood your own Rube Goldberg machine.
- zaphirplane 2 hours ago
  
  We should let the IAM service team know if this glaring gap the hn thread figured out /s
  I’m 99% ;) certain dependencies of foundational services are a well discussed topic

Eridrus 9 hours ago

People say this, but this this was just a single AZ, and in the last 3 years of running my startup mostly out of use-1, and we've only had one regional outage, and even that was partial, with most instances uneffected.

And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

linsomniac 8 hours ago
>And honestly, everybody else's stuff is in use-1
Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.
- electroly 7 hours ago
  
  Not OP, but I do single-region us-east-1 for a few reasons:
  1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.
  2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.
  3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.
  4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.
- Eridrus 4 hours ago
  
  90% of customers are located in use-1. Latency to use-1 is more important than being up when everyone else is down.
- christina97 7 hours ago
  
  But it’s okay to be down when the whole internet is down.
grogenaut 7 hours ago

none of my stuff is in us-east-1. I chose that specifically 15 years ago. Been a great decision.

999900000999 10 hours ago

Too many people are using it.

In fantasy magic dream land loads are distributed evenly across different cloud providers.

A single point of failure doesn't exist.

It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.

Healthcare in the US is affordable.

All types of magical stuff exist here.

But no. It's another day. AWS US-East 1 can take town most of the internet.

afro88 9 hours ago
Core AWS services use it too. Even if you are hosted in another region, you can still be affected by a US-East 1 outage
- 999900000999 9 hours ago
  
  The idea would be to actually load distribute between different cloud providers.
  But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.
  I’m sure someone smarter than me has figured this out.
  
  7 replies →
- b40d-48b2-979e 9 hours ago
  
  I was surprised recently when setting up cloudfront with aws certs that it forced me to use us-east-1 to provision the certs.
- kbbgl87 9 hours ago
  
  STS is only on us-east-1 I believe
  
  2 replies →
- leetrout 9 hours ago
  
  Bingo. This is the one most people don't know about.
echelon_musk 4 hours ago

> It worked out with my first girlfriend. The twins are fluent in English and Korean.
You were dating twins as a form of redundancy?!

ohnei 2 hours ago

I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.

keeganpoppen 12 hours ago

anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...

qaq 11 hours ago
It's no magic given the size of us-east-1 there is no spare capacity to absorb all the workloads
- 8organicbits 10 hours ago
  
  One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.

cherioo 11 hours ago

Is it really failing more, or we just don’t hear about failure happening elsewhere?

Last i heard azure outage it wasn’t even on HN frontpage

stingraycharles 10 hours ago
It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.
I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.
It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.
- adriand 10 hours ago
  
  What kind of reputation does ca-central-1 have? I’ve been using it and it seems quietly excellent. Knock on wood.
  
  2 replies →