Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.
> Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
> Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Modern parallel scheduling systems still have "queues" to manage these concerns; they're just handled in software, with patterns like "work stealing" that describe what happens when unexpected mismatches in execution time must somehow be handled. Even your "sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors" has queues, only the queue is called a "pipeline" and a mismatch in execution speed leads to "pipeline bubbles" and "stalls". You can't really avoid these issues.
the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.
Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Imagine a sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors. The programming for this, even as virtual machine, allows for exploration of a virtually infinite design space of hardware with various tradeoffs for speed, size, cost, reliability, security, etc. The same graph could be refactored to run on anything in that design space.
> Most practical parallel computing hardware had queues to handle the mismatch in compute speed for various CPUs to run different algorithms on part of the data.
> Eliminating the CPU bound compute, and running everything truly in parallel eliminates the need for the queues and all the related hardware/software complexity.
Modern parallel scheduling systems still have "queues" to manage these concerns; they're just handled in software, with patterns like "work stealing" that describe what happens when unexpected mismatches in execution time must somehow be handled. Even your "sea of LUTs (look up tables), that are all clocked and only connected locally to their neighbors" has queues, only the queue is called a "pipeline" and a mismatch in execution speed leads to "pipeline bubbles" and "stalls". You can't really avoid these issues.
the CM architecture or programming model wasn't really a DAG. It was more like tensors of arbitrary rank with power of two sizes. Tensor operations themselves were serialized, but each of them ran in parallel. It was however much nicer than coding vectors today - it included Blelloch scans, generalizied scatter-gather, and systolic-esque nearest neighbor operations (shift this tensor in the positive direction along this axis). I would love to see a language like this that runs on modern GPUs, but its really not sufficiently general to get good performance there I think.
I would not complain about getting my own personal Connection Machine.
So long as Tamiko Thiel does the design.