Comment by nayuki

2 months ago

I skimmed the article and agree with it at a high level, though I haven't faced most of those problems personally.

I have several gripes with NumPy, or more broadly the notion of using Python to call a C/asm library that vectorizes math operations. A lot of people speak of NumPy like the solution to all your high-performance math needs in Python, which I think is disingenuous. The more custom logic you do, the less suitable the tools get. Pure Python numeric code is incredibly slow - like 1/30× compared to Java - and as you find parts that can't be vectorized, you have to drop back down to pure Python.

I would like to give the simple example of the sieve of Eratosthenes:

  def sieve_primeness(limit):
      result = [False] + [True] * limit
      for i in range(2, len(result)):
          if result[i]:
              for j in range(i * i, len(result), i):
                  result[j] = False

This code is scalar, and porting it to a language like C/C++/Rust/Java gives decent performance straight out of the box. Performance in Python is about 1/30× the speed, which is not acceptable.

People pretend that you can hand-wave the performance problem away by using NumPy. Please tell me how to vectorize this Python code. Go on, I'm waiting.

You can't vectorize the `if result[i]` because that controls whether the inner for-loop executes; so it must execute in pure Python. For the inner loop, you can vectorize that by creating a huge temporary array and then AND it with the result array, but that is extremely memory-inefficient compared to flipping bits of the result array in place, and probably messes up the cache too.

Alternatively, you can run the code in PyPy, but that usually gives a speed-up of maybe 3×, which is nowhere near enough to recover the 30× speed penalty.

Another example is that NumPy is not going to help you implement your own bigint library, because that also requires a lot of custom logic that executes between loop iterations. Meanwhile, I've implemented bigint in pure Java with acceptable performance and without issues.

Again, my point is that anytime you venture into custom numerical/logical code that is not the simple `C = A + B`, you enter a world of pain when it comes to Python performance or vectorization.

9 comments

nayuki

nonameiguess 2 months ago

Ironically enough, I wrote a Sieve in NumPy 12 years ago or so in grad school messing around with Project Euler. I haven't touched it since and have not regularly used either NumPy or even Python in quite some time, but I still have the solutions in a private Github repo. The code is:

  def primes(n):
      """Input: A natural number n.
         Output: An array of primes, p < n."""
      assert n >= 2
      sieve = np.ones(n / 2, dtype=np.bool)
      for i in range(3, int(n ** 0.5) + 1, 2):
          if sieve[i / 2]:
              sieve[i * i / 2::i] = False
      return np.r_[2, 2 * np.nonzero(sieve)[0][1::] + 1]

It still relies on looping and I have no idea if you can fully vectorize everything, but it does at least get quite a bit of speedup thanks to the ability to index one array using another, so I can avoid the inner loop you have to use in pure Python.

I have absolutely no clue what that final line is doing and I'm not sure I understood it 12 years ago, either.

gjm11 2 months ago

sieve[i] is nonzero precisely when 2i+1 is an odd prime or i=0 (because it's initialized to all-1 and nothing ever zeros the first element).
So np.nonzero(sieve) is all the indices for which i=0 or 2i+1 is an odd prime.
Except that the behaviour of np.nonzero is a bit weird: it returns a tuple of arrays, one for each axis. (So, e.g., if you feed it a 2d array it gives you (a,b) such that the nonzero elements are at positions (a[0],b[0]), (a[1],b[1]), etc.) That's kinda reasonable behaviour, but it means that for the simple case of a 1-d array you get a 1-element tuple.
So np.nonzero(sieve)[0] is actually all the indices i for which i=0 or 2i+1 is an odd prime, and now you know why I didn't write "all the indices i for which ..." above.
The case i=0 would correspond to 1 being prime, which it isn't considered to be. So we throw that one out.
So np.nonzero(sieve)[0][1::] is all the indices i for which 2i+1 is an odd prime.
And so 2*np.nonzero(sieve)[0][1::]+1 is all the 2i+1 for which 2i+1 is an odd prime; that is, all the odd primes.
I don't think I've encountered np.r_ before, but it's pretty obvious what it's doing: prepending 2 to that array, so now it's "2 followed by all the odd primes" or, more simply, all the primes.
(Obviously, in the above phrases like "all the primes" mean "all the primes up to the limit defined by the size of the array".)

Jtsummers 2 months ago

I'm not a numpy expert, I just use it on occasion but this works and eliminates the explicit inner loop, but not the outer loop. It collects the list of prime numbers while removing multiples of the prime from the numpy array of numbers:

import numpy as np

  def sieve_primes(limit):
    nums = np.arange(limit + 1)
    nums[1] = 0
    primes = []
    for i in range(2, limit):
      if nums[i] != 0: # could just be `if nums[i]:` since 0 is false-y
        primes.append(i)
        nums = nums * (np.mod(nums, i) != 0)
    print(primes)
  sieve_primes(1000)

This takes advantage of the fact that True and False are treated as 1 and 0 in Python for the multiplication.

EDIT: The other solutions people have shown are closer to the sieve itself than my solution. I was having trouble with the slicing notation, now that I've seen how that should be done I'd redo the above as:

  def sieve_primes(limit):
    nums = np.arange(limit + 1)
    nums[1] = 0
    for i in range(2, limit):
      if nums[i] != 0:
        nums[i+i::i] = 0
    print(np.nonzero(nums))

However, the first version is faster by my testing so I'd end up just going back to it if I really cared about performance.

WCSTombs 2 months ago

> Please tell me how to vectorize this Python code. Go on, I'm waiting.

Here's a start:

    import numpy as np

    def sieve(limit):
        result = np.ones(limit + 2, dtype=bool)
        result[0] = False
        result[1] = False
        for i in range(2, limit + 1):
            if result[i]:
                yield i
                result[::i] = False

    for prime in sieve(100):
        print(prime)

As you said, you can't vectorize the outer loop because each iteration depends on the result of previous iterations, so I think you can't do too much better than that with this algorithm. (There are a few easy algorithm optimizations, but I think that's kind of orthogonal to your point, and anyway you can do them in any language.)

I would expect there's still a significant performance gap between that and, say, a simple C implementation, but that shouldn't be surprising, and personally I haven't encountered these supposed droves of people claiming that NumPy fully solves the performance gap. From what I've seen there was a general consensus that vectorizing with NumPy can significantly tighten the gap but can rarely fully close it.

thaumasiotes 2 months ago

> As you said, you can't vectorize the outer loop because each iteration depends on the result of previous iterations
I don't see why not. That check for whether result[i] is prime is just a performance optimization. If you want to do a different performance optimization, you can do that without hurting anything.
Note that while it's true that multiples of prime numbers are composite, it's also true that multiples of composite numbers are composite.

brosco 2 months ago

Have you heard of JIT libraries like numba (https://github.com/numba/numba)? It doesn't work for all python code, but can be helpful for the type of function you gave as an example. There's no need to rewrite anything, just add a decorator to the function. I don't really know how performance compares to C, for example.

xg15 2 months ago

> Please tell me how to vectorize this Python code. Go on, I'm waiting.

You can't entirely get rid of the loops, but at least the inner one can be vectorized, I think. That would reduce the "pythonic runtime" from squared to linear.

Maybe something like this?

  result = np.full((limit+1,), True, dtype=bool)
  result[0] = False
  indices = np.arange(len(result))
  for i in range(2, len(result)):
    result[indices[i:] * i] = False

This is just out of my head, so may contain bugs. Also, the indices array may need to be clamped, not sure right now if numpy accepts out-of-bounds indices.

robert-brown 2 months ago

One big problem is that Python's semantics make it difficult to implement the language efficiently. When I translate the sieve code into safe Common Lisp, I get a speedup of 15. After adding a few declarations, making the code as unsafe as C, the speedup factor is 30. Those improvements are possible because the compiler can assume many things about the functions and operators used in the code. For instance, that the Lisp equivalents of range() and len() have not been redefined.

__mharrison__ 2 months ago

See Numba or Cython...