Comment by biimugan
4 days ago
> You don’t have to randomize the first part of your object keys to ensure they get spread around and avoid hotspots.
From my understanding, I don't think this is completely accurate. But, to be fair, AWS doesn't really document this very well.
From my (informal) conversations with AWS engineers a few months ago, it works approximately like this (modulo some details I'm sure the engineers didn't really want to share):
S3 requests scale based on something called a 'partition'. Partitions form automatically based on the smallest common prefixes among objects in your bucket, and how many requests objects with that prefix receive. And the bucket starts out with a single partition.
So as an example, if you have a bucket with objects "2025-08-20/foo.txt" and "2025-08-19/foo.txt", the smallest common prefix is "2" (or maybe it considers the root as the generator partition, I don't actually know). (As a reminder, a / in an object key has no special significance in S3 -- it's just another character. There are no "sub-directories"). Therefore a partition forms based on that prefix. You start with a single partition.
Now if the object "2025-08-20/foo.txt" suddenly receives a ton of requests, what you'll see happen is S3 throttle those requests for approximately 30-60 minutes. That's the amount of time it takes for a new partition to form. In this case, the smallest common prefix for "2025-08-20/foo.txt" is "2025-08-2". So a 2nd partition forms for that prefix. (Again, the details here may not be fully accurate, but this is the example conveyed to me). Once the partition forms, you're good to go.
But the key issue here with the above situation is you have to wait for that warm up time. So if you have some workload generating or reading a ton of small objects, that workload may get throttled for a non-trivial amount of time until partitions can form. If the workload is sensitive to multi-minute latency, then that's basically an outage condition.
The way around this is that you can submit an AWS support ticket and have them pre-generate partitions for you before your workload actually goes live. Or you could simulate load to generate the partitions. But obviously, neither of these is ideal. Ideally, you should just really not try and store billions of tiny objects and expect unlimited scalability and no latency. For example, you could use some kind of caching layer in front of S3.
Yep, this is still a thing. In the past year I’ve been throttled due to hot partitions. They’ve improved the partitioning so you hit it less, but if you scale too fast you will get limited.
Hit it when building an iceberg Lakehouse using pre existing data. Using object prefixes fixed the issue.
This is my understanding too, and this is particularly problematic for workloads that are read/write heavy on very recent data. When partitioning by a date or by an auto-incrementing id, you still run into the same issue.
Ex: your prefix is /id=12345. S3, under the hood, generates partitions named `/id=` and `/id=1`. Now, your id rolls over to `/id=20000`. All read/write activity on `/id=2xxxx` falls back to the original partition. Now, on rollover, you end up with read contention.
For any high-throughput workloads with unevenly distributed reads, you are best off using some element of randomness, or some evenly distributed partition key, at the root of your path.