Comment by cdbattags
3 years ago
Anyone else think maybe AWS (S3) has made this optimization already? Or would it just be a whole team of kernel engineers optimizing it there?
The overhead on CPU cycles this would save cloud storage systems... Can someone help me quantify the potential savings?
Edit:
They specifically don't list their storage medium on their marketing:
Object stores typically do optimizations where they store small files using a different strategy than large ones, maybe directly in the metadata database.
So that means yes but also they've gone past that optimization?
S3 has always been designed and optimised for large files.
In order to maintain high availability they deliberately trade away latency.
So this blog only really applies to local filesystems not object stores like S3.
Facebook had a published paper on their storage system for pictures, haystack, which iirc is something like a slab allocation.
S3 is similar, in the sense that it has completely different usage than a file system (no hierarchy, direct access, no need for efficient listing etc) so I'm pretty sure they use something similar.
Even if it is bypassing the file system, S3 is itself essentially a file system. It has all the usual features of paths, permissions, and so on. I assume it can't completely escape the same issues.
S3 is a key-value store where object keys might contain slashes, but the implied directories don’t really exist. This is a problem for Spark and Hadoop jobs that expect to rename a large temp dir to signal that a stage’s output has been committed, because HDFS can do that atomically but S3 requires renaming objects one by one. IAM security policies also apply to keys or prefixes (renaming an object might change someone’s access level) and changes are cached for tens of minutes.
S3 didn’t used to be strongly consistent, though surprisingly they delivered https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-s3... which I hope they’re proud of.
Some people have been crazy enough to store tables of padded data in the keys of a lot of zero-length objects (which they do charge for) and use ListObjects for paginated prefix queries. It doesn’t much matter whether keys have slashes or commas or what.
But that would be 1 layer up in a network of that size, no? Edit:
Let's call it "Ring -1"