Comment by adrian_b

4 days ago

If you really want to encrypt and decrypt 32-bit numbers without having any nonces available, the fastest way on non-microcontroller CPUs remains using the AES instructions.

You can exploit the fact that the core of AES consists of 32-bit invertible mixing functions. In order to extend AES to 128-bit, a byte permutation is used, which mixes the bytes of the 32-bit words.

The AES instructions are such, that you can cancel the byte permutation. In this case, you can use the AES instructions to encrypt separately four 32-bit words, instead of one 128-bit block.

Similarly by canceling the standard byte permutation and replacing it with separate permutations on the 2 halves, you can make the AES instructions independently encrypt two 64-bit words.

These AES modifications remain faster than any software cipher.

How to cancel the internal permutation and replace it with external shuffle instructions was already described in the Intel white paper published in 2010, at the launch of Westmere, the first CPU with AES instructions.

5 comments

adrian_b

201984 4 days ago

Are you certain using AES is still faster? Let's say for a 32-bit block size and 64-bit key.

From https://en.wikipedia.org/wiki/Speck_(cipher), that Speck combination would use 22 rounds, and using the instruction timings for Zen 5 from https://instlatx64.github.io/InstLatx64/AuthenticAMD/Authent..., it looks like each round would take at most 3 cycles. (Dependency chain for each round is 3 instructions long, ror+add+xor). 22*3 = ~66 cycles.

Using AES with a pshufb to take out the ShiftRows step would be 2 cycles for the pshufb and 4 cycles for each aesenc, and at 10 rounds, you have ~60 cycles.

It's quite close, and to say which one wins, we'd need to actually benchmark it. One is not clearly much faster than the other.

adrian_b 3 days ago
Standard AES-128 has a throughput of around 16 bytes per 8 clock cycles or even less in recent CPUs, because they can do 2 or 4 AES instructions per clock cycle (in the modes of operation that are not limited by latency).
AES-128 can be easily modified to independently encrypting four 32-bit words per execution, instead of one 128-bit block, by cancelling the byte permutation that extends the AES mixing function from 32-bit to 128-bit. this would increase the throughput at least twice, depending on whether PSHUFB is done concurrently or not.
You have given the latencies of the instructions, not their throughput. When you use AES in such a way that you are limited by latency, that is normally wrong. The cryptographic libraries have multi-buffer functions, which compute e.g. 8 AES values, so that they are not limited by latencies.
Regarding the parent article, if you want an unpredictable identifier for a record, you should not do this by encrypting some value with the intent of decrypting it in the future. Instead of this, you should use as identifier an unpredictable random number. Such identifiers can be generated with AES in batches, at maximum throughput, and stored until they are needed for assignment to a record.
If you need in your record some information like time of creation or a monotonically increasing number, which you consider private, such information should be put in distinct fields, that you do not give externally, instead of attempting to encrypt them in a record identifier, which would need to be decrypted to access such information.
- 201984 3 days ago
  
  >You have given the latencies of the instructions, not their throughput. When you use AES in such a way that you are limited by latency, that is normally wrong.
  I did that because TFA is talking about encrypting 32 bit IDs, which is 1/4th of an AES block. There aren't multiple blocks to do at once in this scenario, and throughput numbers do not apply because each instruction depends on the result of the one before.
  You mention doing multiple IDs at once, but the overhead of pulling multiple IDs into a single batch from something akin to URLs in web requests is likely gonna be worse than any gains.
  >Instead of this, you should use as identifier an unpredictable random number. Such identifiers can be generated with AES in batches, at maximum throughput, and stored until they are needed for assignment to a record.
  Now you lose the ability to sort the records in a database, and I fail to see what AES gives you here over any other random number generator.
botusaurus 4 days ago
maybe the reason they are so close is that the AES microcode is inplementing exactly those operations
- 201984 4 days ago
  
  There's nothing similar about AES and Speck, and the "microcode" for AES isn't like what you're thinking of. If you want to learn more about it, you can look up the specifications for AES and Intel's AES instruction set.