Comment by zackmorris

4 days ago

It only took a quarter century, but I'm glad that somebody is finally adding a little multicore competition since Moore's law began failing in the mid-2000s.

I looked around a bit, and the going rate appears to be about $10,000 per 64 cores, or around $150 per core. Here is an Intel Xeon Platinum 8592+ 64 Core Processor with 61 billion transistors:

https://www.itcreations.com/product/144410

So that's about 500 million transistors per dollar, or 1 billion transistors for $2.

It looks like Arm's 136 core Neoverse V3 has between 150 and 200 billion transistors, so it should cost around $400. Each blade has 2 of those chips, so should be around $800-1000 for compute. It doesn't say how much memory the blades come with, but that's a secondary concern.

Note that this is way too many cores for 1 bus, since by Amdahl's law, more than about 4-8 cores per bus typically results in the remaining cores getting wasted. Real-world performance will be bandwidth-limited, so I would expect a blade to perform about the same as a 16-64 core computer. But that depends on mesh topology, so maybe I'm wrong (AI thinks I might be):

  Intel Xeon Scalable: Switched from a Ring to a Mesh Architecture starting with Skylake-SP to handle higher core counts.
  
  Arm Neoverse V3 / AGI: Uses the Arm CMN-700 (Coherent Mesh Network), which is a high-bandwidth 2D mesh designed specifically to link over 100 cores and multiple memory controllers.

I find all of this to be somewhat exhausting. We're long overdue for modular transputers. I'm envisioning small boards with 4-16 cores between 1-4 GHz and 1-16 GB of memory approaching $100 or less with economies of scale. They would be stackable horizontally and vertically, to easily create clusters with as many cores as one desires. The cluster could appear to the user as an array of separate computers, a single multicore computer running in a unified address space, or various custom configurations. Then libraries could provide APIs to run existing 3D, AI, tensor and similar SIMD code, since it's trivial to run SIMD on MIMD but very challenging to run MIMD on SIMD. This is similar to how we often see Lisp runtimes written in C/C++, but never C/C++ runtimes written in Lisp.

It would have been unthinkable to design such a thing even a year ago, but with the arrival of AI, that seems straightforward, even pedestrian. If this design ever manifests, I do wonder how hard it would be to get into a fab. It's a chicken and egg problem, because people can't imagine a world that isn't compute-bound, just like they couldn't imagine a world after the arrival of AI.

Edit: https://news.ycombinator.com/item?id=47506641 has Arm AGI specs. Looks like it has DDR5-8800 (12x DDR5 channels) so that's just under 12 cores per bus, which actually aligns well with Amdahl's law. Maybe Arm is building the transputer I always wanted. I just wish prices were an order of magnitude lower so that we could actually play around with this stuff.

1 comment

zackmorris

pixelpoet 4 days ago

Amdahl's law is about the maximum speedup obtainable from parallelism, not balancing memory bandwidth with compute.