← Back to context

Comment by harshreality

4 years ago

Naive implementations of stock matrix math can't get anywhere close to numpy or julia, which both use BLAS and automatically parallelize across cores.

  % python matrix.py
  Timing 10 squares of a random 10000 x 10000 matrix
  97.3976636590669 seconds
  python matrix.py  364.41s user 8.10s system 379% cpu 1:38.25 total

julia has more overhead, and the first multiply triggers code compilation so there's an additional warm-up square outside of the timing loop, but its "warm" performance is equivalent to numpy. Turning on extra optimizations (-O3) can even make it a couple seconds faster than numpy once warmed up.

  % julia matrix.jl
  Timing 10 squares of a random 10000 x 10000 matrix
   97.787679 seconds (31 allocations: 7.451 GiB, 0.33% gc time)
  julia matrix.jl  405.34s user 8.13s system 375% cpu 1:50.09 total

If you're going to wait for that C implementation, or the other comment's K implementation, to finish that loop, you'll want a book.