Comment by mikewarot
1 year ago
The key transformation required to make any parallel architecture work is going to be taking a program that humans can understand, and translating it into a directed acyclic graph of logical Boolean operations. This type of intermediate representation could then be broken up into little chunks for all those small CPUS. It could be executed very slowly using just a few logic gates and enough ram to hold the state, or it could run at FPGA speeds or better on a generic sea of LUTs.
Reminds me of Mill Computing's stuff.
https://millcomputing.com/
Mill Computing's proposed architecture is more like VLIW with lots of custom "tricks" in the ISA and programming model to make it nearly as effective as the usual out-of-order execution than a "generic sea" of small CPU's. VLIW CPU's are far from 'tiny' in a general sense.
This sounds like graph reduction as done by https://haflang.github.io/ and that flavor of special purpose CPU.
The downside of reducing a large graph is the need for high bandwidth low latency memory.
The upside is that tiny CPUs attached directly to the memory could do reduction (execution).
Like interaction nets?
Isn't that the Connection Machine architecture?
Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.
> Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
> Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Modern parallel scheduling systems still have "queues" to manage these concerns; they're just handled in software, with patterns like "work stealing" that describe what happens when unexpected mismatches in execution time must somehow be handled. Even your "sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors" has queues, only the queue is called a "pipeline" and a mismatch in execution speed leads to "pipeline bubbles" and "stalls". You can't really avoid these issues.
the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.
I would not complain about getting my own personal Connection Machine.
So long as Tamiko Thiel does the design.
there’s a differentiable version of this that compiles to C or CUDA: difflogic