Comment by donavanm
4 days ago
Since ~2014 or so the constraint on all HDD based storage has been IOPs/throughput/queue time. Shortly after that we started seeing "minimum" device sizes that were so large as to be challenging to productively use their total capacity. Glacier type retrieval is also nice in that you have much more room for "best effort" scheduling and queuing compared to "real time" request like S3:PutObject.
Last I was aware flash/nvme storage didnt have quite the same problem, due to orers of magnitude improved access times and parallelism. But you can combine the two in a kind of distributed reimplementation of access tiering (behind a single consistent API or block interface).
There’s a really old trick with HDDs where you buy a big disc and then allocate less than half of it. There’s more throughput on the first half of the disk, more tracks per cylinder so fewer seeks, and never having to read half the disk reduces the worst case seek time. All increase IOPs.
But then what do you do with the other half of the disk? If you access it when the machine isn’t dormant you lose most of these benefits.
For deep storage you have two problems. Time to access the files, and resources to locate the files. In a distributed file store there’s the potential for chatty access or large memory footprints for directory structures. You might need an elaborate system to locate file 54325 if you’re doing some consistent hashing thing, but the customer has no clue what 54325 is. They want the birthday party video. So they still need a directory structure even if you can avoid it.