Comment by jerf

14 hours ago

You can draw out a sort of performance hierachy, from fastest to slowest:

    * Optimized GPU code
    * CPU vectorized code
    * Static CPU unvectorized code
    * Dynamic CPU code

where the last one refers to the fact that a language like Python, in order to add two numbers together in its native, pure-Python mode, does a lot of boxing, unboxing, resolving of class types and checking for overrides, etc.

Each of those is at least an order of magnitude slower than the next one up the hierarchy, and most of them appreciably more than one. You're probably closer to think of them as more like 1.5 orders of magnitude as a sort of back-of-the-envelope understanding.

Using NumPy incorrectly can accidentally take you from the top one, all the way to the bottom one, in one fell swoop. That can be a big deal, real quick. Or real slow, as the case may be.

In more complicated scenarios, it matters how much computation is going how far down that hierarchy. If by "processing a video frame by frame" you mean something like "I wrote a for loop on the frames but all the math is still in NumPy", you've taken "iterating on frames" from the top to the bottom, but who cares, Python can iterate on even a million things plenty quickly, especially with everything else that is going on. If, by constrast, you mean that at some point you're iterating over each pixel in pure Python, you just fell all the way down that hierarchy for each pixel and you're in bigger trouble.

In my opinionated opinion, the trouble isn't so much that it's possible to fall down that stack. That is arguably a feature, after all; surely we should have the capability of doing that sort of thing if we want. The problem is how easy it is to do without realizing it, just by using Python in what looks like perfectly sensible ways. If you aren't a systems engineer it can be hard to tell you've fallen, and even if you are honestly the docs don't make it particularly easy to figure out.

Plus it isn't a checkbox on a UI where Electon being 1000 times slower (1ms instead of 1micro) would be noticeable.

It could be a 12 hour run vs. 12000000 hour run.

As a simple example : once upon a time a We needed to generate a sort of heat map. Doing it in pure python takes a few seconds at the desired size (few thousand cells where each cell needs a small formula). Dropping to numpy braucht that downs to hundreds of milliseconds. Pushing it to pure c got us to tens of milliseconds.

  • Yeah one of other beauties of numpy is you can pass data to/from native shared libraries compiled from C code with little overhead. This was more klidgy in Matlab last I checked

that's a great hierarchy!

though what does "static cpu" vs "dynamic cpu" mean? it's one thing to be pointer chasing and missing the cache like OCaml can, it's another to be running a full interpreter loop to add two numbers like python does