Comment by atombender

8 hours ago

I really wish there was a seamless system for this. Once you try to do this kind of thing, you run into all sorts of rabbit holes and cans of worms.

For example, coalescing blobs into "superblobs" to avoid a proliferation of small objects means you invent a whole system for tracking "subfiles" within a bigger file.

And you'll need a compacting job to ensure old, deleted data is expunged, which may be more important than you think if the data has to be erased for privacy or legal reasons.

Object storage has no in-place mutation, so this compaction has to be transactionally safe and must be careful not to leave behind cruft on failure, and so on.

Furthermore, storing blobs in object storage without keeping a local inventory of them is, in my experience, a disaster. For example, if your database has tenants or some other structural grouping, something simple like finding out how much blob storage a specific tenant has is a very time-consuming operation on S3/GCS/etc. because you need to filter the whole bucket by prefix. So for every blob you store, you want to have a database table of what they are so that the only object operations you do are reads and writes, not metadata operations.

Sure, you have things like inventory reports on GCS that can help, but I would still say that you need to track this stuff transactionally. The database must be the source of truth, and the object storage must never be used as a database.

And so on.

This need to be able to store many small objects in object storage is coming up more and more for me, as is the desire to mutate them in-place or at least append. For example, imagine you want to build a kind of database which stores a replicated copy of itself in the cloud. There is no way to do this in S3-like object storage without representing this as a series of immutable "snapshots" and "deltas". It's fast to append this way, but you run into the problem of eventually needing to compact, and you absolutely have to batch up the uploads in order to avoid writing too many small objects.

So lately I've pondered using something else for this type of work, like a key/value database, like FoundationDB or TiKV, or even something like Ceph. I wonder if anyone else has tried that?

4 comments

atombender

huntaub 7 hours ago

Well, I think this is what our company, Archil, is working on. We basically built an SSD clustering layer that proxies/caches/and assembles requests into object storage so that you can run a POSIX file system directly on top.

There's also some really great projects like SlateDB in this space, which could be more like what you're looking for (~RocksDB like API that runs on S3).

atombender 5 hours ago

Your product looks very interesting, I will take a look!

ovaistariq 7 hours ago

Well we have made small objects work well on Tigris (https://www.tigrisdata.com/). And we have several use cases of folks using it as KV store. Funny that you mention FoundationDB, we use that for our metadata storage.

atombender 5 hours ago

I've heard good things about Tigris. If that means I can store billions of objects without being bankrupted by request cost, and it has fast read access (GCS is quite poor here), then that helps a lot! I'm looking right now for a system that lets me store lots of very small blobs around 4KB each.