← Back to context

Comment by benmmurphy

2 days ago

Generating 16 random bytes is simpler than generating a random UUID

It basically makes no odds, unless you consider applying a constant AND and constant OR operator complicated - as UUID v4 is just 122 random bits and 6 bits fixed.

UUID v7 is a 48 bit timestamp, 74 random bits and 6 bits fixed. Sure, this is a little more complicated, but it's often worth it for many applications because it can be sorted, so keys will be approximately monotonically increasing.

  • I think UUIDv7 could make sense but I suspect the recommendations in the spec predate UUIDv7. Also, if you want sorted schemes then there are slightly more efficient schemes than UUIDv7. With UUID you are always sacrificing some bits to distinguish between the UUID types which I guess does not really matter in practice but it seems unnecessary.

    • Yeah, in practice the 6 bits you lose aren't important. Redoing his calculations with 122 bits and a quadrillion generated IDs (that's a million billion), the probability of a collision is 9.4 x 10^-8 (or one in 10 million) using UUID v4.

      In my opinion, UUID v7 is useful because you per millisecond, you still have 74 bits split between user defined (up to 12) or randomness (minimum 62). If you choose the minimum 64 bits randomness, you can read the numbers straight from the article - 1 million UUIDs per millisecond with less than one in a million chance of collision, but you still have 10 bits to add additional data, such as which machine generated it.

      If you stick with just time and have the full 74 bits of randomness, you can generate a trillion (10^12) UUIDs per millisecond with less than one in 40 billion chance of collision (2.6 x 10^-11) using UUID v7.

      I think the fact the formula is (k^2/2N) actually shows that having a time component makes better use of the bits than a purely randomised space. In this example, we have a lower chance of collision with a trillion (10^12) UUIDs generated per millisecond than a quadrillion (10^15) UUIDs across all time.

    • BTW I realised I didn't address why those bits are necessary - actually, while it might seem you are increasing randomness with more bits and so reducing the risk of collisions, that's not necessarily true.

      The old schemes generated numbers that weren't uniformly distributed across the 128-bit space as they were intentionally biased in certain ways, such as time [0] and MAC addresses [1]. This means that most of the IDs generated in previous schemes would have many bits in common, and so the UUIDs that had been generated were not uniformly distributed across that 128-bit space [2] and so if you just used the whole 128-bits for random data, but didn't use those extra bits to avoid conflicts with the previous schemes, then random IDs that happened to be valid in the previous schemes would be more likely to collide.

      Of course, this only matters if the properties of globally unique matter to you. For a closed system with a guaranteed scope, sure who cares? But given that the extra randomness doesn't add any useful value beyond a certain threshold, you might as well use a UUID because you don't know what that identifier might end up being used for in the future, plus you can use off-the-shelf systems to generate them.

      [0] Ironically, future proofed time fields with many bits are more likely to be non-linearly distributed - e.g. the original version 0 UUID supported timestamps from 1582AD to 5236AD but was only used from 1987 for around a decade.

      [1] With certain manufacturers of network cards massively more popular than others, their MAC address prefixes showed up significantly more frequently, and there were privacy concerns were you could correlate between UUIDs generated on a single machine, and sometimes infer machines that might be on the same network because they had similar MAC addresses and so the cards were probably all from the same manufacturing batch.

      [2] Which is fine within the scope of UUIDs as they are still very likely to be globally unique, so it doesn't really matter if bits are wasted in this scheme

And there’s a good reason for that, because UUIDs have additional properties. I don’t know if versioning, partial ordering, or stable references are useful for traces or not, but with UUIDs those could’ve been a possibility.