Comment by bluGill
5 days ago
> Lookup tables are always faster than calculation - is that true
No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.
However I doubt that is what is meant. For complex calculations there are a lot of it depends and tradeoffs. A lookup table will often force you to think about trade offs because the table takes up a lot more memory and so you need to decide what values are important. A lookup table is also prone to bugs - back in the 1990s someone noticed that the Intel Pentium processor didn't give the right results for division - turns out they didn't enter a few values into the table correctly - if you write a table you could have the same bug.
Calculating sin() to as many decimal places as your highest precision floating point register allows will be slow, but that is likely what the sin built into your standard library does since you might be building a bridge that the person who wrote that sin function crosses latter. If you only need sin rounded to the nearest whole number a lookup table is probably faster. If you need sin to as precise as the computer can calculate that is a lot of RAM (x86 uses 80 bits internally for floating point numbers)
> No. A simple counter example: a single ADD will be faster than a lookup table on nearly anything.
Note that a round of AES is now one aesenc instruction on modern systems.
You might be surprised how much better code is than memory lookups. Modern AMD Zen5 cores have 8 instruction pipelines but only 3 load/store pipelines.
You have more AVX512 throughput on modern Zen5 cores (4x Vector pipelines) than L1 throughput.
I'd go as far out to say that table lookups are the worst they've ever been in terms of compute speed. The reason modern encryption/hashing got so fast is that XChaCha and SHA3 are add/for/rotate based rather than lookup-based (sbox based like AES or DES).
Tables are still appropriate for some operations, but really prefer calculations if at all possible. Doubly so if you are entering GPU code where you get another magnitude more compute without much memory bandwidth improvements.
Oh, if you need the best of both worlds, consider pshufb (4-bit lookup table), or if you have access to AVX512 you could use vpermi2b as an effective 7-bit lookup table.
It's not quite a full memory lookup table but these instructions get a lookup-like behavior but using the vector units (128-bit or 512-bit registers).