Comment by thecloud
1 day ago
Thanks for the insight! Mind expanding on what alternatives are being used in high reliability systems instead of UUIDv4?
1 day ago
Thanks for the insight! Mind expanding on what alternatives are being used in high reliability systems instead of UUIDv4?
In high-reliability systems a criterion for identifier design is easy detection of defective identifiers. This includes buggy systems and adversarial manipulation.
The problem with UUIDs that rely on entropy sources is that it is computationally expensive to detect if the statistical distribution of identifiers is diverging from what you would expect from a random oracle. I've written systems that can detect entropy source anomalies but you'll want to turn it off in production.
It is pretty cheap to sanity check most non-probabilistic identifier schemes. UUIDs that use broken hash algorithms (e.g. UUIDv3/5) or leak state (e.g. UUIDv7) are exposed to adversarial exploitation.
The identifier scheme is dependent on the use case. Does the uniqueness constraint apply to the instance of the object or the contents of the object? Is the generation of identifiers federated across untrusted nodes? How large is the potential universe of identifiers?
The basic scheme I've seen is a 128-bit structured value that has no probabilistic component. These identifiers can be encrypted with AES-128 when exported to the public, guaranteeing uniqueness while leaking no internal state. The benefit of this scheme is that it is usually drop-in compatible with standard UUID even though it is technically not a UUID and the internal structure can carry useful metadata about the identifier if you can decrypt it.
Federated generation across untrusted nodes requires a more complex scheme, particularly if the universe of identifiers is extremely large. These intrinsically have a collision risk regardless of how the identifiers are generated.
All of the standardized UUID really weren't designed with the requirements of scalable high-reliability systems in mind. They were optimized for convenience and expedience which is a perfectly reasonable objective. Most people don't need an identifier system engineered for extreme reliability, even though there is relatively little cost to having one.
How does a high-reliability system have a broken /dev/random? You're better off fixing it rather than trying to fix every downstream component that uses it. You can put your AES-128 counter there if you can count reliably.
> leak state (e.g. UUIDv7)
But according to PostgreSQL, UUIDv7 provides better performance in the database, so is this essentially a trade off between security and speed?
Yes, because UUIDv7 gives up some random bits in order to include the timestamp, which is done in a way that makes UUIDv7s quick to sort by timestamp.
4 replies →
The latest UUID (7?) Uses half random gen, half timestamp. This not only makes it sortable by creation, but would also make a collision like this impossible.
It's still possible in most implementations of UUIDv7.
UUIDv7 assigns the first 48 bits for the timestamp in milliseconds. You can generate a lot of UUID's in a millisecond though!
Then you have another 12 bits that you can use as you wish; "rand_a". The spec has a few methods they suggest on how to use these bits including 12 bits of random data, using it for sub-millisecond timestamps, or creating a monotonic counter, but each have their downsides:
- Purely random data means you can still run into collisions and anything within the same millisecond is unordered
- Sub millisecond you can run into collisions; there's nothing stopping you from generating two UUID's with the same 62 bits of rand_b data in the same sub-millisecond timestamp.
- Monotonic counters can overflow before the next tick, then what? Rollover? Once you roll over it's no longer monotonic and you can generate the same random data within the same monotonic cycle. Also; it's only monotonic to the system that's generating the UUID. If you have a distributed system and they each have their own monotonic cycles then you'll be generating UUID's with the same timestamp + monotonic counter, and again, are relying on not generating the same random data.
You can steal some of the 62 bits in rand_b if you want as well; you can use rand_a for sub-millisecond accuracy, and then use a few bits of rand_b for a monotonic counter. There's still a chance of collision here, but it's exceedingly low at the expense of less truly random data at the end.
If you want truly collision free, you'd also need to assign a couple of bits to identify the subsystem generating the UUID so that the monotonic counter is unique to that subsystem. You lose the ordering part of the monotonic counter this way though, but I guess you could argue that in nearly 100% of cases the accuracy of sub-millisecond order in a distributed system is a lie anyways.
I think by the time you're building a system that needs to generate (and persist!) billions of identifiers per millisecond, you're solidly past the point where all your design decisions need to be vetted for whether they make sense on your extremely exotic setup.
1 reply →
We have a dedicated snowflake id generator service that returns batch ids. It's also distributed, each service adds its own instance number to the id. When it overflows it just blocks for the next ms. For our traffic, it's never a bottleneck.
1 reply →
Considering the context I think it's worth pointing out that it's technically not impossible - it's just even less likely.
Everything in crypto is always a probability - never a certainty
True, but it makes the specific collision the post observed completely impossible.
7 replies →
The spec doesn't require the use of actually random numbers though.
UUIDv7 is arguably better, because it is entropy plus time.
It is what I usually use for its sorting, but some people don't want to leak time info.
Entropy is not a requirement in the UUID spec.
Sequences, generally.
That depends on your definition of high-availability. If high availability includes distributed writers, (global) sequences are not the best solution because generating unique sequence values requires synchronisation between all writers. In those cases, you might need to explicitly partition the ID space so that individual writers are guaranteed not to get in each others' hair.
That is merely a sequence generation strategy.