Comment by IshKebab

1 year ago

Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.

Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.

The key transformation required to make any parallel architecture work is going to be taking a program that humans can understand, and translating it into a directed acyclic graph of logical Boolean operations. This type of intermediate representation could then be broken up into little chunks for all those small CPUS. It could be executed very slowly using just a few logic gates and enough ram to hold the state, or it could run at FPGA speeds or better on a generic sea of LUTs.

  • Reminds me of Mill Computing's stuff.

    https://millcomputing.com/

    • Mill Computing's proposed architecture is more like VLIW with lots of custom "tricks" in the ISA and programming model to make it nearly as effective as the usual out-of-order execution than a "generic sea" of small CPU's. VLIW CPU's are far from 'tiny' in a general sense.

  • This sounds like graph reduction as done by https://haflang.github.io/ and that flavor of special purpose CPU.

    The downside of reducing a large graph is the need for high bandwidth low latency memory.

    The upside is that tiny CPUs attached directly to the memory could do reduction (execution).

  • Isn't that the Connection Machine architecture?

    • Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.

      Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.

      Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.

      1 reply →

    • the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.

    • I would not complain about getting my own personal Connection Machine.

      So long as Tamiko Thiel does the design.

  • there’s a differentiable version of this that compiles to C or CUDA: difflogic

Yep, transputers failed miserably. I wrote a ton a code for them. Everything had to be solved in a serial bus, which defeated the purpose of the transputer.

  • Quite fascinating. Did you write about your experiences in that area? Would love to read it!

    • Not in those terms, but an autobiography is coming, and bits and pieces are being explained. I expect about 10 people to buy the book, as all of the socialists will want it for free. I am negotiating with a publisher as we speak on the terms.

Could you elaborate on this? How does many-small-CPUs make for a weirder programming model than a GPU?

Im no expert, but I’ve done my fair share of parallel HPC stuff using MPI, and a little bit of Cuda. And to me the GPU programming model is far far “weirder” and harder to code for than the many-CPUs model. (Granted, I’m assuming you’re describing a different regime?)

  • In CUDA you don't really manage the individual compute units, you start a kernel, and the drivers take care of distributing that to the compute cores and managing the data flows between them.

    When programming CPUs however you are controlling and managing the individual threads. Of course, there are libraries which can do that for you, but fundamentally it's a different model.

    • The GPU equivalent of a single CPU "hardware thread" is called a "warp" or a "wavefront". GPU's can run many warps/wavefronts per compute unit by switching between warps to hide memory access latency. A CPU core can do this with two hardware threads, using Hyperthreading/2-way SMT, some CPU's have 4-way SMT, but GPU's push that quite a bit further.

    • What you say has nothing to do with CPU vs. GPU, or with CUDA, which is basically equivalent with the older OpenMP.

      When you have a set of concurrent threads, each thread may run a different program. There are many applications where this is necessary, but such applications are hard to scale to very high levels of concurrency, because each thread must be handled individually by the programmer.

      Another case is when all the threads run the same program, but on different data. This is equivalent with a concurrent execution of a "for" loop, which is always possible when the iterations are independent.

      The execution of such a set of threads that execute the same program has been named "parallel DO instruction" by Melvin E. Conway in 1963, "array of processes" by C. A. R. Hoare in 1978, "replicated parallel" in the Occam programming language in 1985, SPMD around the same time, "PARALLEL DO" in the OpenMP Fortran language extension in 1997, "parallel for" in the OpenMP C/C++ language extension in 1998, and "kernel execution" in CUDA, which has also introduced the superfluous acronym SIMT to describe it.

      When a problem can be solved by a set of concurrent threads that run the same program, then it is much simpler to scale the parallelism to extremely high levels and the parallel execution can usually be scheduled by a compiler or by a hardware controller without the programmer having to be concerned with the details.

      There is no inherent difficulty in making a compiler that provides exactly the same programming model as CUDA, but which creates a program for a CPU, not for a GPU. Such compilers exist, e.g. ispc, which is mentioned in the parent article.

      The difference between GPUs and CPUs is that the former appear to have some extra hardware support for what you describe as "distributing that to the compute cores and managing the data flows between them", but nobody is able to tell exactly what is done by this extra hardware support and whether it really matters, because it is a part of the GPUs that has never been documented publicly by the GPU vendors.

      From the point of view of the programmer, this possible hardware advantage of the GPUs does not really matter, because there are plenty of programming language extensions for parallelism and libraries that can take care of the details of thread spawning and work distribution over SIMD lanes, regardless if the target is a CPU or a GPU.

      Whenever you write a program equivalent with a "parallel for", which is the same as writing for CUDA, you do not manage individual threads, because what you write, the "kernel" in CUDA lingo, can be executed by thousands of threads, also on a CPU, not only on a GPU. A desktop CPU like Ryzen 9 9950X has the same product of threads by SIMD lanes like a big integrated GPU (obviously, discrete GPUs can be many times bigger).

While acknowledging that it's theoretically possible other approaches might succeed, it seems quite clear the author agrees with you.

Im guessing you just dont have the computational power to compete with a real GPU. It would be relatively easy for a top end graphics programmer to write the front end graphics API for your chip. Im guessing that if they did this you would just end up with a very poor performing GPU.

my take from reading this is more about programming abstractions than any particular hardware instantiation. the part of the Connection Machine that remains interesting is not building machines with CPUS with transistor counts in the hundreds running off a globally synchronous clock, but that there were a whole family of SIMD languages and let you do general purpose programming in parallel. And that those language were still relevant when the architecture changed to a MIMD machine with a bunch of vector units behind each CPU.