Comment by bastawhiz

1 day ago

1. The list of "scopes" are the object hierarchy that owns the resource. That lets you figure out which shard a resource should be in. You want all the resources for the same repository on the same shard, otherwise if you simply hash the id, one shard going down takes down much of your service since everything is spread more or less uniformly across shards.

2. The object identifier is at the end. That should be strictly increasing, so all the resources for the same scope are ordered in the DB. This is one of the benefits of uuid7.

3. The first element is almost certainly a version. If you do a migration like this, you don't want to rule out doing it again. If you're packing bits, it's nearly impossible to know what's in the data without an identifier, so without the version you might not be able know whether the id is new or old.

Another commenter mentioned that you should encrypt this data. Hard pass! Decrypting each id is decidedly slower than b64 decode. Moreover, if you're picking apart IDs, you're relying on an interface that was never made for you. There's nothing sensitive in there: you're just setting yourself up for a possible (probable?) world of pain in the future. GitHub doesn't have to stop you from shooting your foot off.

Moreover, encrypting the contents of the ID makes them sort randomly. This is to be avoided: it means similar/related objects are not stored near each other, and you can't do simple range scans over your data.

You could decrypt the ids on the way in and store both the unencrypted and encrypted versions in the DB, but why? That's a lot of complexity, effort, and resources to stop randos on the Internet from relying on an internal, non-sensitive data format.

As for the old IDs that are still appearing, they are almost certainly:

1. Sharded by their own id (i.e., users are sharded by user id, not repo id), so you don't need additional information. Use something like rendezvous hashing to choose the shard.

2. Got sharded before the new id format was developed, and it's just not worth the trouble to change

AES is faster than base64 on modern CPUs, especially for small messages.

  • AES would mean the encrypted parts of the id are ~28+ bytes. That's a long minimum identifier length.

    What you're suggesting is perhaps true in the sense that the throughout is higher, but AES decryption carries a fairly high fixed overhead. If you're in a language like Ruby (as GitHub is) or Python/Node, you're probably calling out to openssl.

    I did try to do my diligence and find data to support or refute your claim, but I wasn't able to find anything that does directly. That said, I'm not able to find any sources that support the idea that AES is faster at decryption than base64 in any context (for small plaintext values or in general). With SIMD, b64 often decodes in 0.2 CPU cycles or so per byte, while AES only manages 2.5-10.7 CPU cycles per byte. The numbers for AES get better as the plaintext size grow, though.

    Do you happen to have data to support your claim?