← Back to context

Comment by newpavlov

1 year ago

>What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does.

For "embarrassingly parallel" jobs vector extensions start to eat tiny bits of the GPU pie.

Unfortunately, just slapping thousands of cores works poorly in practice. You quickly get into the synchronization wall caused by unified memory. GPUs cleverly work around this issue by using numerous tricks often hidden behind extremely complex drivers (IIRC CUDA exposes some of this complexity).

The future may be in a more explicit NUMA, i.e. in the "network of cores". Such hardware would expose a lot of cores with their own private memory (explicit caches, if you will) and you would need to explicitly transact with the bigger global memory. But, unfortunately, programming for such hardware would be much harder (especially if code has to be universal enough to target different specs), so I don't have high hopes for such paradigm to become massively popular.

It’s weird that no one mentioned xeon phi cards… that’s essentially what they were. Up to 188 (iirc?) x86 atom cores, fully generically programmable.

  • I consider Xeon Phi to be the shipping version of Larrabee. I've updated the post to mention it.

Seems to me there's a trend of applying explicit distributed systems (network of small-SRAM-ed cores each with some SIMD, explicit high-bandwidth message-passing between them, maybe some specialized ASICs such as tensor cores, FFT blocks...) looking at tenstorrent, cerebras, even kalray... out of the CUDA/GPU world, accelerators seem to be converging a bit. We're going to need a whole lot of tooling, hopefully relatively 'meta'.

Networks of cores ... Congrats you have just taken a computer and shrunk it so there are many on a single chip ... Just gonna say here AWS does exactly this network of computers thing ... Might be profitable