Comment by dlenski
10 hours ago
The idea that AWS's services are fully regionalized or isolated has always been a myth.
All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.
And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.
Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.
But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.
> outside of China
[Nitpick] There are a few more AWS partitions like GovCloud:
https://jasonbutz.info/2023/07/aws-partitions/
IAM isn’t even really the most painful dependency. Route53 is. The control plane only runs out of use1.
Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.
Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?
They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.
Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.
Isn't this kind of circular dependency what lead to extended downtime a while back?
It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
Yes, I concur.
Sometimes the circular dependencies get almost cartoonishly silly.
Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."
I made that example up, but only barely.
7 replies →
a circular dependency and a single point of failure are not the same thing. If I have a single point of failure and it is down, I fix that and things work again. If I have circular dependency, there is no obvious way to fix anything that is broken any longer.
when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.
I'm glad I never had to get that deep into the failure chain.
> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
When you dogfood your own Rube Goldberg machine.
We should let the IAM service team know if this glaring gap the hn thread figured out /s
I’m 99% ;) certain dependencies of foundational services are a well discussed topic
> The idea that AWS's services are fully regionalized or isolated has always been a myth.
This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]
The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.
Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).
> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.
This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.
If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.
[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."
[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."
[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."