Comment by harshreality
4 years ago
Naive implementations of stock matrix math can't get anywhere close to numpy or julia, which both use BLAS and automatically parallelize across cores.
% python matrix.py
Timing 10 squares of a random 10000 x 10000 matrix
97.3976636590669 seconds
python matrix.py 364.41s user 8.10s system 379% cpu 1:38.25 total
julia has more overhead, and the first multiply triggers code compilation so there's an additional warm-up square outside of the timing loop, but its "warm" performance is equivalent to numpy. Turning on extra optimizations (-O3) can even make it a couple seconds faster than numpy once warmed up.
% julia matrix.jl
Timing 10 squares of a random 10000 x 10000 matrix
97.787679 seconds (31 allocations: 7.451 GiB, 0.33% gc time)
julia matrix.jl 405.34s user 8.13s system 375% cpu 1:50.09 total
If you're going to wait for that C implementation, or the other comment's K implementation, to finish that loop, you'll want a book.
No comments yet
Contribute on Hacker News ↗