Comment by bayindirh

8 days ago

...and we always circle back to "premature optimization is the root of all evil", since processors are a wee bit more intelligent with our instructions than we thought. :)

Not that intelligent. If you have two loads and one store per cycle, then that’s it. (Recall that we have SSDs with 14 GB/s sequential reads now, yet CPU clocks are below 6 GHz.) Most of the computational power of a high-performance CPU is in the vector part; still the CPU won’t try to exploit it if you don’t, and the compiler will try but outside of the simplest cases won’t succeed. (Most of the computational power of a high-performance computer is in the GPU, but I haven’t gotten that deep yet.)

I don’t mean to say that inefficient solutions are unacceptable; they have their place. I do mean to say that, for example, for software running on end-users’ computers (including phones), the programmer is hardly entitled to judge the efficiency on the scale of the entire machine—the entire machine does not belong to them.

> We should forget about small inefficiences, say 97% of the time; premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

D. E. Knuth (1974), “Structured Programming with go to Statements”, ACM Comput. Surv. 6(4).

  • You are right, but with a good optimizing compiler and out of order execution, your code will not work the way you guess most of the time, even though it accomplishes what you want.

    On the other hand, while doing high performance compute, the processor will try to act smart to keep everything saturated. As a result, you still need to look at cache trash ratio, IPC, retirement ratio, etc. to see whether you are using the system at its peak performance, and again CPU is doing its thing to keep the numbers high, but that's not enough of course. You have to do your own part and write good code.

    In these cases where you share the machine (which can be a cluster node or a mobile phone), maximizing this performance is again beneficial since it allows smoother operation both for your and other users' code in general. Trying to saturate the system with your process is a completely different thing, but you don't have to do that to have nice and performant code.

    GPU computation is nice, and you can do big things fast, but it's not suitable for optimizing and offloading every kind of task, and even if though the task is suitable for the GPU, the scale of the computation still matters, because a competent programmer can fit billions of computations until a GPU starts running your kernel. The overhead is just too big.

    Knuth's full quote doesn't actually invalidate me, because that's how I operate while writing code, designed for high performance or not.

Wait until you learn the Tomasulo algorithm is not a magic wizard box for fast code, but that you can also write code that meshes well with it and get even more performance. Your CPU can run up to 5ish instructions every cycle - are you giving it instructions in suitably matched groups to maximize parallelism?

Premature optimization may be the root of all evil, but timely optimization can give you insane benchmark numbers, at least. Not enough people pay attention to the word "premature" and they think it's about all optimization. Sadly, I've never worked on anything where it would have been timely. Haven't even used SIMD. Just watched other people's talks about it.

Some general principles are widely applicable: modern processors love to work on big linear SIMD-aligned arrays of data. Hence you have the structure-of-arrays pattern and so on. While a CPU is still the best tool you have for complex branching, you should avoid complex branching as much as possible. In Z80-era architectures it was the reverse - following a pointer or a branch was free, which is why they invented things like linked lists and virtual functions.

  • I have written a boundary element method based material simulation code in my Ph.D., and had time and opportunity to apply both good design practices and some informed coding such as minimum and biased branching (so the code runs unimpeded for most of the time), lockless algorithms and atomics in parallel code where applicable. Moreover I looked for hotspots with callgrind, and checked for leaks with valgrind.

    As a result, I was handling two 3000x3000 (dense and double precision floating point) matrices and some vectors under ~250MB RAM consumption, and getting in an out of whole integration routine under a minute. Whole evaluation took a bit more than a minute in last decade's hardware.

    Perf numbers returned great. ~5IPC, 20% cache trash, completely and efficiently saturated FPU and memory bandwidth.

    Result? The PoC was running in 30 minutes, my code was run and done under 2 minutes for most problems. The throughput was 1.7 million complete Gaussian Integrations per second, per core. We have an adaptive approach which takes many partial integrations for a one complete integration [0], so it amounted to a higher number of integrations (IIRC it was ~5 mil/sec/core, but my memory is hazy.).

    The code I have written doesn't call for SIMD explicitly, but Eigen [1] most probably does for Matrix routines. Nevertheless, the core "secret sauce" was not the formulae but how it's implemented in a way which minds for the processor's taste. Adding "-march native -mtune native -O3 -G0" gave a 3x performance boost in some steps of the code.

    Currently slowly reimplementing this in a tighter form, so let's see how it goes.

    So, also considering what I do for a job, I know how nuanced Knuth's quote is, but people are people, and I'm not the one who likes to toot his horn 7/24/365.

    I just wanted to share this to confirm, validate and support your claim only, and to continue conversation if you are interested more in this.

    [0]: https://journals.tubitak.gov.tr/elektrik/vol29/iss2/45/

    [1]: https://eigen.tuxfamily.org/

    • Indeed, scheduling instructions into parallel-compatible aligned blocks is menial work that's usually best done by a machine; each CPU has different preferences, so it only works well if the machine knows which kind of CPU the code will actually run on.

      Eigen certainly uses a bunch of optimizations, including SIMD, but also things like FFTs and matrix decompositions.