Comment by unnouinceput

4 years ago

Any performance benchmarks between a simple C implementation of matrix multiplication like explained in the article and K one?

Story time: 2 years ago, right before 1st lockdown, I landed a client. A data scientist who had already implemented an algorithm which dealt with matrices, implemented by a previous programmer he hired. Said implementation was in Python, no more than 10 line altogether, which performed well when matrix size where small, like 10x10. But the problem was, his real work need it to have matrices of size 10^6 x 10^6. Not only the Python implementation had to be ran on a beast of server with 4TB of memory, it also took 3 days to finish. And while the algorithm was small in size in Python, its explaining paper was 4 pages in total, which took me 1 week to understand. And then an entire month to implement in C. But in the end, when all was said and done, the run time when used with real data was only 20 minutes and consumed only 8 GB of memory, though it did required at least 16 virtual processors.

Hence my question, in the end performance is what it matters, not number of lines.

K timing, squaring a 100x100 matrix in ngn/k:

   a:100 0N#?10000    / 100x100 double matrix
   \t:1000 a(+/*)\:a  / Total of 1k runs in ms
  1514

Following C program outputs 272 when compiled and run with -O3 -march=native.

  #include <stdio.h>
  #include <time.h>
  size_t monoclock(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return 1000000000*ts.tv_sec + ts.tv_nsec;
  }
  
  int main() {
    const size_t n = 100;
    double a[n][n], r[n][n];
    size_t t = monoclock();
    for (size_t iter=0; iter<1000; iter++) {
      for (size_t i=0; i<n; i++) {
        for (size_t j=0; j<n; j++) {
          double sum = 0;
          for (size_t k=0; k<n; k++) sum += a[i][k]*a[k][j];
          r[i][j] = sum;
        }
      }
    }
    printf("%lld\n", (monoclock() - t)/1000000);
  }

So K is slower, by a factor of 5 rather than the 200 you saw with Python. K appears to scale to larger matrices better than C: 8664 in K vs 2588 in C for 200x200, but I can't be bothered to sort out heap allocation for the C code to do larger sizes. I would certainly not say that your story supports the idea that performance is what matters. In the month you took to implement the C version, the Python solution could have run ten times over, and there are many computations that only ever need to be performed once. Besides, I bet you were glad to have working Python code to use as a reference!

  • I prefer "middle of the road" languages that are high-level AND readable AND have decent performance optimisation for bulk operations. Python with C libraries suffices for a lot of people, Julia similarly is getting popular.

    Even the older Mathematica language blows both K and C out of the water for readability and performance:

        m = Table[RandomReal[], 100, 100];
        t = RepeatedTiming[MatrixPower[m, 2]];
        First[t]*1000*1000
    
        18.0245 
    

    You would have to know literally zero about Mathematica's Wolfram Language to be able to read that clearly! From a standing start you could understand what another person has created. For K, you'd have to have memorized the K-specific syntax. For C, if you hadn't seen the standard patterns for matrix multiplication you'd have to read the code carefully. A lot of it is just noise, like the verbose for loop syntax.

    Oh, and Mathematica's matrix power function is:

    - Parallel! If I use a 10K x 10K matrix as an input, it uses about 75% CPU on my 8-core laptop. It can complete a single multiplication in 5.3 seconds. For laughs, try that with either K or C and see what you get...

    - Extends to negative or fraction powers.

    - Has optimisations for applying the matrix power directly to a vector.

    - Is extensively documented, unlike the terse K snippet or the hand-rolled C code: https://reference.wolfram.com/language/ref/MatrixPower.html....

    Essentially what I'm trying to say is that terseness is an anti-pattern, and doesn't even begin to approach the utility of a well designed high-level language intended for teams of collaborating humans.

  • When you compile the C example with the compiler option "-Wall", you'll get a warning that the variable 'r' is set but not used, so the compiler would be free to simply skip the for loops. In fact, if you compile with clang instead of gcc, the compiler will do just that and you'll get almost zero computation time.

    It would be better to do something with the computed result so the compiler does not remove the computation, e.g. print a randomly selected value.

    I also benchmarked the fixed C code against Python and "Python" (which uses OpenBLAS under the hood in my case) was 10 times faster:

        import time
        import numpy as np
    
        a = np.random.rand(100, 100)
        t = time.perf_counter()
    
        for _ in range(1000): a @ a
    
        print(time.perf_counter() - t, "seconds")
    

    Implementation matters a lot for matrix multiplication.

    • Yes, definitely a quick and dirty benchmark (I did test after I posted to see if initializing a does anything; it didn't). Timings for J below, since I think it's the most focused on linear algebra. The remarkable thing about the K code from the article is that it's all made from totally general-purpose pieces that have nothing to do with matrix products, and K interpreters don't have any clue what a matrix product is. In J the matrix product is written +/ .* with the generalized dot product operator . (which does need the preceding space, oof) and handled by specialized code. Given that, I found this measurement a little disappointing: about as fast as my C code in the 100x100 case and slightly faster in the 200x200 case.

           a =: ?100 100$0
           <.1e6 * 1000 (6!:2) '+/ .*~ a'
        269
           a =: ?200 200$0
           <.1e6 * 1000 (6!:2) '+/ .*~ a'
        1796

  • Naive implementations of stock matrix math can't get anywhere close to numpy or julia, which both use BLAS and automatically parallelize across cores.

      % python matrix.py
      Timing 10 squares of a random 10000 x 10000 matrix
      97.3976636590669 seconds
      python matrix.py  364.41s user 8.10s system 379% cpu 1:38.25 total
    

    julia has more overhead, and the first multiply triggers code compilation so there's an additional warm-up square outside of the timing loop, but its "warm" performance is equivalent to numpy. Turning on extra optimizations (-O3) can even make it a couple seconds faster than numpy once warmed up.

      % julia matrix.jl
      Timing 10 squares of a random 10000 x 10000 matrix
       97.787679 seconds (31 allocations: 7.451 GiB, 0.33% gc time)
      julia matrix.jl  405.34s user 8.13s system 375% cpu 1:50.09 total
    

    If you're going to wait for that C implementation, or the other comment's K implementation, to finish that loop, you'll want a book.

  • Nice benchmarking, thank you for your effort.

    Also the scientist was leading a team, so my program would've been used by at least 20 people, 10 times per day. That was why Python one was a no go for them from beginning.