Comment by dspillett

2 years ago

> Use regular bigserial (64bit) PKs for internal table relations and UUIDs (128bit) for application-level identifiers and natural keys.

If you are worried about size ballooning, rather than the randomness, make sure you actually need more than 32-bit standard integers (well, 31-bit as they tend to be signed and we usually start from 0 or 1 not -2,147,483,648).

Avoiding roll-over issues is sometimes a valid concern sometimes with 32-bit INTs, but I've seen people specify 64-bit surrogate keys when their data is not likely to grow to where 31 would be an issue in the next few hundred years… There is an argument that 64-bit values are more efficient in modern CPUs, but I've not seen any good tests that show a measurable difference in the context of DB keys where there are other overheads to consider due to the doubling in size.

2 comments

dspillett

paperplatter 2 years ago

It's not just about the data size. There are reasons your sequence can increment without a row actually being inserted.

dspillett 2 years ago

Yep.
# Transactions that are explicitly rolled back, which have inserted to a table with an IDENTITY (or equivalent) column, almost always result in such a gap. Though if this is happening enough that it pushes the count up by an order of magnitude over time, you probably have a design problem…
# To reduce locking issues on the counter many DBs hold a cache of values (SQL Server allocates 1,000 at a time) in the persisted count so after a shutdown, especially an unclean shutdown, or a fail-over if using one of the clustering/mirroring/options options for HA, it is possible that most of these will be skipped. If this is happening enough to be a long-term problem than you have significant infrastructure issues. Such a cache could be per-thread/-process (IIRC it isn't in SQL Server) which is one of the reasons that IDENTITY column values can seem to be out-of-order in high concurrency environments (though you should never care about the actual order of values in such columns so that is a non-issue).
# In some DR processes where old DBs are restored, the values may be artificially updated after restore if used values might have been exposed to an external service (to avoid confusion caused is a number is used twice from the point of view of that external service). This is an extra argument against exposing internal ID values to external systems and in favour of having an additional UUID based identifier for such purposes.
# Also, though this is the most obvious one and hardly needs stating, deleted rows won't “return” their now unused number to be reused. You should scale the type based on rows expected to be inserted, not rows expected to be kept long-term.
If you are expecting hundreds of millions of rows in the lifetime of the table then a type larger than 32-bit is worth considering despite the extra storage needed in data and index pages. I'd not criticise considering it for tens of millions if you want to be more paranoid. Otherwise: 32-bit is very unlikely to not be fine, and 64-bit overkill.