Comment by dlenski
8 hours ago
The idea that AWS's services are fully regionalized or isolated has always been a myth.
All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.
And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.
Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.
But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.
Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?
They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.
Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.
Isn't this kind of circular dependency what lead to extended downtime a while back?
It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
Yes, I concur.
Sometimes the circular dependencies get almost cartoonishly silly.
Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."
I made that example up, but only barely.
7 replies →
when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.
I'm glad I never had to get that deep into the failure chain.
> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
When you dogfood your own Rube Goldberg machine.
We should let the IAM service team know if this glaring gap the hn thread figured out /s
I’m 99% ;) certain dependencies of foundational services are a well discussed topic