← Back to context

Comment by zahlman

12 hours ago

TFA keeps repeating "you can't use loops", but aren't they, like, merely less performant? I understand that there are going to be people out there doing complex algorithms (perhaps part of an ML system) where that performance is crucial and you might as well not be using NumPy in the first place if you skip any opportunities to do things in The Clever NumPy Way. But say I'm just, like, processing a video frame by frame, by using TCNW on each frame and iterating over the time dimension; surely that won't matter?

Also: TIL you can apparently use multi-dimensional NumPy arrays as NumPy array indexers, and they don't just collapse into 1-dimensional iterables. I expected `A[:,i,j,:]` not to work, or to be the same as if `j` were just `(0, 1)`. But instead, it apparently causes transposition with the previous dimension... ?

You can draw out a sort of performance hierachy, from fastest to slowest:

    * Optimized GPU code
    * CPU vectorized code
    * Static CPU unvectorized code
    * Dynamic CPU code

where the last one refers to the fact that a language like Python, in order to add two numbers together in its native, pure-Python mode, does a lot of boxing, unboxing, resolving of class types and checking for overrides, etc.

Each of those is at least an order of magnitude slower than the next one up the hierarchy, and most of them appreciably more than one. You're probably closer to think of them as more like 1.5 orders of magnitude as a sort of back-of-the-envelope understanding.

Using NumPy incorrectly can accidentally take you from the top one, all the way to the bottom one, in one fell swoop. That can be a big deal, real quick. Or real slow, as the case may be.

In more complicated scenarios, it matters how much computation is going how far down that hierarchy. If by "processing a video frame by frame" you mean something like "I wrote a for loop on the frames but all the math is still in NumPy", you've taken "iterating on frames" from the top to the bottom, but who cares, Python can iterate on even a million things plenty quickly, especially with everything else that is going on. If, by constrast, you mean that at some point you're iterating over each pixel in pure Python, you just fell all the way down that hierarchy for each pixel and you're in bigger trouble.

In my opinionated opinion, the trouble isn't so much that it's possible to fall down that stack. That is arguably a feature, after all; surely we should have the capability of doing that sort of thing if we want. The problem is how easy it is to do without realizing it, just by using Python in what looks like perfectly sensible ways. If you aren't a systems engineer it can be hard to tell you've fallen, and even if you are honestly the docs don't make it particularly easy to figure out.

  • Plus it isn't a checkbox on a UI where Electon being 1000 times slower (1ms instead of 1micro) would be noticeable.

    It could be a 12 hour run vs. 12000000 hour run.

  • As a simple example : once upon a time a We needed to generate a sort of heat map. Doing it in pure python takes a few seconds at the desired size (few thousand cells where each cell needs a small formula). Dropping to numpy braucht that downs to hundreds of milliseconds. Pushing it to pure c got us to tens of milliseconds.

    • Yeah one of other beauties of numpy is you can pass data to/from native shared libraries compiled from C code with little overhead. This was more klidgy in Matlab last I checked

  • that's a great hierarchy!

    though what does "static cpu" vs "dynamic cpu" mean? it's one thing to be pointer chasing and missing the cache like OCaml can, it's another to be running a full interpreter loop to add two numbers like python does

"merely less performant" is severely underselling it. It could easily add a few zeros to your execution time.

(And that's before you even consider GPUs.)

It's a slippery slope. Sometimes a python loop outside some numpy logic is fine but it's insane how much perf you can leave on the table if you overdo it.

It's not just python adding interpretor overhead, you also risk creating a lot of temporary arrays i.e. costly mallocs and memcopies.

Right, you can use loops. But then it goes much slower than a GPU permits.

  • But once you need to use the GPU you need to go to another framework anyway (e.g. jax, tensorflow, arrayfire, numba...). AFAIK many of those can parallise loops using their jit functionality (in fact, e.g. numbas jit for a long time could not deal with numpy broadcasing, so you had to write out your loops). So you're not really running into a problem?

  • My point is that plenty of people use NumPy for reasons that have nothing to do with a GPU.

    • The whole point of NumPy is to make things much, much faster than interpreted Python, whether you're GPU-accelerated or not.

      Even code you write now, you may need to GPU accelerate later, as your simulations grow.

      Falling back on loops is against the entire reason of using NumPy in the first place.

      5 replies →

    • I mean yes. Also in your example where you hardly spend any time running Python code, the performance difference likely wouldn't matter.