← Back to context

Comment by dragontamer

5 days ago

> No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.

Note that a round of AES is now one aesenc instruction on modern systems.

You might be surprised how much better code is than memory lookups. Modern AMD Zen5 cores have 8 instruction pipelines but only 3 load/store pipelines.

You have more AVX512 throughput on modern Zen5 cores (4x Vector pipelines) than L1 throughput.

I'd go as far out to say that table lookups are the worst they've ever been in terms of compute speed. The reason modern encryption/hashing got so fast is that XChaCha and SHA3 are add/for/rotate based rather than lookup-based (sbox based like AES or DES).

Tables are still appropriate for some operations, but really prefer calculations if at all possible. Doubly so if you are entering GPU code where you get another magnitude more compute without much memory bandwidth improvements.

Oh, if you need the best of both worlds, consider pshufb (4-bit lookup table), or if you have access to AVX512 you could use vpermi2b as an effective 7-bit lookup table.

It's not quite a full memory lookup table but these instructions get a lookup-like behavior but using the vector units (128-bit or 512-bit registers).