Distributed ID formats are architectural commitments, not just data types

2 months ago (piljoong.dev)

10 comments

mnahkies

Generation and structure are important, but IME IDs arent complete without consideration of representation; encoding and opacity.

* User facing IDs must be opaque. If users can infer any structure or ordering from your ID they _will_ use and they _will_ create awkward dependencies on "your" implementation detail. My favorite example is the multi year and many many dev years of effort that went in to extending EC2 instance IDs. They were already assumed/intended to be opaque until clever users inferred the structure! The simplest answer of something like block cipher is so cheap as to be free (and can be accounted for as part of versioning).

* Encoding should be tailored for teh primary UX. Ex teh base32 variants are reasonably efficient and accommodating of text selection & input. Dictionary schemes (ala S/KEY rfc2289 or BIP39) may be more appropriate for voice communication.

* Following ID structure -> opacity -> encoding you should probably account for the block size and encoding efficiency to minimize padding or excess characters

caust1c 2 months ago

> it prevents entire classes of bugs where IDs get mixed up across services.

~Does this really happen for people? I haven't ever seen this class of bug, and shudder to think of how it happens in code. Sure support tickets are nicer with the prefix, but how would a bug manifest in the code itself?~

Edit: of course it can happen with `new.id = old.id` where new and old are different types, now that I think about it after coffee. However, I'd be hesitant to claim that this prevents those bugs, instead I'd argue that it simply makes them easier to identify.

Also, KSUID has been around since before UUIDv7 and seems to meet all of the author's same requirements and has many client libraries already. Guess people doing research on it still aren't able to find it, or just want to do their own anyway which is cool too.

IanCal 2 months ago

A few things here, one is that getting them mixed up isn't the same as them ending up in the wrong place. Looking fruitlessly for a missing user with id 9287dfn9384f that has some support ticket only to find out later that this was their order ID and the wrong thing was put in a form (particularly if you ask users for something, I see 4 different IDs floating around, which is my customer/user/product/order/shipping id?). If the user id is "order_1239123" I know to go and look at the orders table, not "dance around all tables trying to find what on earth the id could relate to".
However there are two other ways this happens which come to mind:
1. HTTP API calls. There's a whole untyped layer going on here. You could have a PUT to set some content, and create a new entry if one doesn't exist - there's a place for a call to go to the wrong place but succeed. If two entities have the same structure but are different things this can be harder to spot.
2. Manual entry. This is the bigger one. Vast amounts of workflows have somewhere that a human does a thing, and that will go wrong at times. An excel file uploaded to the wrong place, the wrong header, the wrong bit typed on a form.
A counting up of numbers is a simple ID but awful for both of these, because now you have overlapping IDs across systems. Having larger random IDs takes you out of this risky area because at some point the chance of a collision is so low as to be not worth considering, however even small IDs with a prefix like "customer_001" and "order_001" quickly removes a source of problems.
I'm not saying any of this is the only solution, or that there aren't better ones, but I can see places where this kind of thing can easily happen in the real world and "prefix the ids with a type" is a very small and simple change that can help.
I would also note that this then becomes a front facing thing that ties you into something as well though, and not all types are as nice to make public.

orefalo 2 months ago

I wrote an article comparing different GUID implementations and also prepared a clear spreadsheet with side‑by‑side implementation comparisons.

https://medium.com/@orefalo_66733/globally-unique-identifier...

Looking at your implementation, I like the clean split between shard, tenant, and sequence.

However, this results in a 160‑bit format, which does not fit natively in most databases, as they usually use the UUID type. I also find 60 bits of randomness to be low (ULID also uses 60).

Last point, using a GUID is not only for sharding. It is also important for protecting against predictability, which beyond the GUID structure, requires using the right approved crypto‑safe random generator.

glenjamin 2 months ago

A failure mode of ULIDs and similar is that they're too random to be easily compared or recognized by eye.

This is especially useful when you're using them for customer or user IDs - being able to easily spot your important or troublesome customers in logs is very helpful

Personally I'd go with a ULID-like scheme similar to the one in the OP - but I'd aim to use the smallest number of bits I could get away with, and pick a compact encoding scheme

CGamesPlay 2 months ago

The checksum idea is interesting, but why make it a tack-on at the end? Taking 20 random bits to use for a mandatory checksum seems like an interesting trade-off.

theoli 2 months ago

Epoch shift with 48-bit timestamp that has >12,000 years of range to get another 50 years of range is an amusing choice.

mrkeen 2 months ago

> The old auto-increment IDs were totally fine—until suddenly they weren’t, because multiple shards couldn’t share the same global counter anymore.

> Their workaround was simple and surprisingly effective: they offset new IDs by a huge constant—roughly a billion. Old IDs stayed below the threshold, new IDs lived above it, and nothing collided. It worked surprisingly well, but it also taught me something.

So what was the fix? The new numbers are bigger? I need a little more detail.

> If your system is running on a single database with moderate traffic, auto-increment is still probably the best answer. Don’t overthink it.

If autoincrement is the simplest way to do things, but breaks if you evolve the system in any conceivable way, maybe autoincrement isn't the simplest way to do things.

Isn't that the point of the article?

frutiger 2 months ago

> ID formats aren’t just formats. They’re commitments.

Reading direct LLM output is highly cringeworthy.