← Back to context

Comment by adrian_b

16 hours ago

Obviously it is easier to write any program as a single sequential thread, because you do not need to think about the dependencies between program statements. When you append a statement, you assume that all previous statements have been already executed, so the new statement can access without worries any data it needs.

The problem is that the speed of a single thread is limited and there exists no chance to increase it by significant amounts.

As long as we will continue to use silicon, there will be negligible increases in clock frequency. Switching to other semiconductors might bring us a double clock frequency in 10 years from now, but there will never be again a decade like that from 1993 to 2003, when the clock frequencies have increased 50 times.

The slow yearly increase in instructions per clock cycle is obtained by making the hardware do more and more of the work that has not been done by the programmer or the compiler, i.e. by extracting from the sequential program the separate chains of dependent instructions that should have been written as distinct threads, in order to execute them concurrently.

This division of a single instruction sequence into separate threads is extremely inefficient when done at runtime by hardware. Because of this the CPU cores with very high IPC have lower and lower performance per area and per power with the increase of the IPC. Low performance per area and per power means low multithreaded performance.

So the CPU cores with very good single-threaded performance, like Intel Lion Cove or Arm Cortex-X925 have very poor multi-threaded performance and using many of them in a CPU would be futile, because in the same limited area one could put many more small CPU cores, achieving a much higher total performance.

This is why such big CPU cores that are good for single-threaded applications must be paired with smaller CPU cores, like Intel Skymont or Arm Cortex-X4, in order to obtain a good multi-threaded performance.

Writing the program as a single thread is easy and of course it should always be done so when the achieved performance is good enough on the current big superscalar CPU cores.

On the other hand, whenever the performance is not sufficient, there is no way to increase it a lot otherwise than by decomposing the work that must be done into multiple concurrent activities.

The easy case is that of iterations, which frequently provide large amounts of work that can be done concurrently. Moreover, with iterations there are many tools that can create concurrent threads automatically, like OpenMP or NVIDIA CUDA.

Where there are no iterations, one may need to do much more work to identify the dependencies between activities, in order to determine which may be executed concurrently, because they do not have functional dependencies between them.

However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.

Nevertheless even in such cases one can question whether the specification of the program is truly what the end user needs. For instance, when computing a hash over a file, the actual goal is normally not the computation of the hash, but to verify whether the file is the same as another (where the other file may be a past version of the same file, to detect modification, or an apparently distinct file coming from another source, when deduplication is desired). In such cases, it does not really matter which hash function is used, so it may be acceptable to replace the hash algorithm with another that allows concurrent computation, solving the performance problem.

Similar reformulations of the problem that must be solved may help in other cases where initially it appears that it is not possible to decompose the workload into concurrent tasks.

> However, when an entire program consists of a single chain of dependent instructions, which may happen e.g. when computing certain kinds of hash functions over a file, you are doomed. There is no way to increase the performance of that program.

Even in that case, you would probably benefit from having many cores because the user is probably running other things on the same machine, or the program is running on a runtime with eg garbage collector threads etc. I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!

  • > I’d venture it’s quite rare that the entire machine is waiting on a single sequential task!

    But that happens all the time in video game code.

    Video games may have many threads running, but there's usually a single-thread bottleneck. To the point that P-cores and massively huge Zen5 cores are so much better for video games.

    Javascript (ie: rendering webpages) is single-threaded bound, which is probably why the Phone makers have focused so much on making bigger cores as well. Yes, there's plenty of opportunities for parallelism in web browsers and webpages. But most of the work is in the main Javascript thread called at the root.