Comment by dragontamer

3 years ago

A modern GPU with 5120 shaders and 2Ghz speeds can perform 2^32 ops in just 0.4 milliseconds. (AMD's Rx 6900xt. Yes, sub-millisecond times)

The truth of todays computers is that the 32-bit space is a sub-second problem, possibly faster than human reaction speeds.

Test the whole 32 bit space. It's cheap today.

-------

Extrapolating a bit: the 2^48 space is 27 seconds per op. Maybe 100 ops are needed for your test code, so the 48-bit space (aka: 3x 16-bit input space) can be exhausted and tested in roughly 50 minutes.

And a 64b sweep is ~21 days. As a HW guy who tests fp units... just saying.

The real issue is that the verification code is much, much, slower.

  • > And a 64b sweep is ~21 days

    Per GPU. A consumer GPU at that (A100 is faster).

    4x GPUs (the AMD 6900 XT) per node and maybe 10x nodes in a rack (4U per node, 42U cabinet) can do 64-bit sweep in half a day.

    > The real issue is that the verification code is much, much, slower.

    Isn't verification of 64b multipliers actually a 128-bit problem? (x * y == z, so you have 2x 64-bit numbers to work with).

    Also, I think they use BDDs, which although multiplication is exponential, its still small enough exponent that 128-bit BDDs fit on today's computers. So its not a brute force thing like discussed in this topic, but actually a very elegant data-structure (though with tons of complications, as is the nature of NP-complete and expspace problems).

    • It’s a lot easier to do formal verification… which is a fancy BDD, in many ways.

Even modern CPUs can explore the 32-bit space extremely quickly. If you get a 128-core CPU running at 4 GHz then can do 2^32 ops in 1/128 seconds. Pretty amazing. The improvement in the nine years since I wrote the article is fun to see.

  • Ah, but you've missed two key advantages of CPUs:

    1. Superscalar (multiple instructions in one clock tick)

    2. SIMD (not as good as GPU, but still substantial)

    Ryzen 9 7950X (Zen4 cores, 16-cores) has AVX512. Its only 256-bit wide, but there are 4 pipelines (!!). (Actually, 8 pipelines, but only 4 of them are SIMD. Its complicated, lets just assume 4)

    So each pipeline can handle 8x32 bit integers per clock, and up to 4 pipelines can be active at once. I do believe Zen4 allows for 4-pipelines to all be performing the classic "MAD / Multiply-and-accumulate" instruction (very important for padding those matrix multiplication TFLOPS numbers), though how much you want to count this is... rather difficult because the 4x pipelines don't all have the same capabilities.

    -----------

    I can't say I've tested this out myself. But the Ryzen 9 7950X has a theoretical throughput of 16 cores x 8 ops per pipeline x 4 pipelines == 512 x 32-bit ops per clock tick, and is therefore 4x faster than your "128-core" estimate.

    Running this code in practice is difficult... but similar to GPU programming. Intel ispc is a good tool for generating optimized AVX512 in the "GPU programming style". I know that OpenMP has a SIMD-statement, and the autovectorizers on GCC/CLang/LLVM are getting better.

    A lot of options, all difficult but "valid"... at least valid for a simple brute-force embarrassingly parallel problem as discussed here. Things get rather complicated rather quickly in this space when we talk about practical applications...

    ------------

    Note: to achieve this in practice requires some very careful coding and magic. You'll need to fit inside of the micro-op cache, you'll need to keep the pipeline perfectly open, etc. etc.