← Back to context

Comment by stcredzero

8 years ago

I've long felt that there's something less than half-baked about the multi CPU architecture we're currently using. The hacky contortions HFT coders have come up with to avoid things like False Sharing strike me as a big red flag.

https://mechanical-sympathy.blogspot.com/2011/07/false-shari...

How about an architecture more like Erlang's, where you have independent processes with their own CPU core, where each has their own memory, but where you have much faster communications supported at lower hardware levels? Why not have a multi-processor architecture designed for direct support of Hoare CSP-inspired languages?

Hypercube topology: http://web.eecs.umich.edu/~qstout/pap/IEEEM86.pdf

Something like this?

Parallelism is inherent in most problems but due to current programming models and architectures which have evolved from a sequential paradigm, the parallelism exploited is restricted. We believe that the most efficient parallel execution is achieved when applications are represented as graphs of operations and data, which can then be mapped for execution on a modular and scalable processing-in-memory architecture. In this paper, we present PHOENIX, a general-purpose architecture composed of many Processing Elements (PEs) with memory storage and efficient computational logic units interconnected with a mesh network-on-chip. A preliminary design of PHOENIX shows it is possible to include 10,000 PEs with a storage capacity of 0.6GByte on a 1.5cm2 chip using 14nm technology. PHOENIX may achieve 6TFLOPS with a power consumption of up to 42W, which results in a peak energy efficiency of at least 143GFLOPS/W. A simple estimate shows that for a 4K FFT, PHOENIX achieves 117GFLOPS/W which is more than double of what is achieved by state-of-the-art systems.

https://memsys.io/wp-content/uploads/2017/12/20171003-Memsys...

Something like this:

1) https://www.sciencedirect.com/science/article/pii/S014193311...

(PDF: https://science.raphael.poss.name/pub/poss.13.micpro.pdf )

"The Apple-CORE project has co-designed a general machine model and concurrency control interface with dedicated hardware support for concurrency management across multiple cores. Its SVP interface combines dataflow synchronisation with imperative programming, towards the efficient use of parallelism in general-purpose workloads. Its implementation in hardware provides logic able to coordinate single-issue, in-order multi-threaded RISC cores into computation clusters on chip, called Microgrids. In contrast with the traditional “accelerator” approach, Microgrids are components in distributed systems on chip that consider both clusters of small cores and optional, larger sequential cores as system services shared between applications.

2) https://ieeexplore.ieee.org/document/7300441/ (PDF: https://science.raphael.poss.name/pub/poss.15.tpds.pdf )

"This article advocates the use of new architectural features commonly found in many-cores to replace the machine model underlying Unix-like operating systems. "

That's like the Cell processor. For every year that you program for Cell you need at least two years of therapy.

  • If you're going to change the substrate or paradigm, then you need to do a dynamite job of supporting your users. Sony did not do that.

No networking can touch silicon-level interconnect between cores or within cores on a single chip, at least for latency. Erlang's model of computation doesn't have much to say about physical implementation, and multi-socket/distributed systems are not perfomant for latency-critical user applications. For servers and high performance computing sure, I guess in theory we could use tons of simple single-core chips, but fabrication costs and energy efficiency would be significantly worsened.

  • No networking can touch silicon-level interconnect between cores or within cores on a single chip

    So how about silicon-level interconnect that looks like networking? As it is now, it seems almost designed to elicit badly non-optimal code.

    multi-socket/distributed systems are not perfomant for latency-critical user applications...fabrication costs and energy efficiency would be significantly worsened.

    I think there would be tremendous benefits if we started designing multi-socket/distributed system that could perform in those situations. For one thing, Intel has currently painted itself into a corner with regards to large wafer yields, and AMD is kicking their butts by combining smaller dies.

    https://www.youtube.com/watch?v=ucMQermB9wQ