Comment by grg0
1 year ago
The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.
- You need to compile shader source/bytecode at runtime; you can't just "run" a program.
- On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.
- You need to synchronize data access between CPU-GPU and GPU workloads.
- You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.
- You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.
What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.
>What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does.
For "embarrassingly parallel" jobs vector extensions start to eat tiny bits of the GPU pie.
Unfortunately, just slapping thousands of cores works poorly in practice. You quickly get into the synchronization wall caused by unified memory. GPUs cleverly work around this issue by using numerous tricks often hidden behind extremely complex drivers (IIRC CUDA exposes some of this complexity).
The future may be in a more explicit NUMA, i.e. in the "network of cores". Such hardware would expose a lot of cores with their own private memory (explicit caches, if you will) and you would need to explicitly transact with the bigger global memory. But, unfortunately, programming for such hardware would be much harder (especially if code has to be universal enough to target different specs), so I don't have high hopes for such paradigm to become massively popular.
It’s weird that no one mentioned xeon phi cards… that’s essentially what they were. Up to 188 (iirc?) x86 atom cores, fully generically programmable.
I consider Xeon Phi to be the shipping version of Larrabee. I've updated the post to mention it.
Seems to me there's a trend of applying explicit distributed systems (network of small-SRAM-ed cores each with some SIMD, explicit high-bandwidth message-passing between them, maybe some specialized ASICs such as tensor cores, FFT blocks...) looking at tenstorrent, cerebras, even kalray... out of the CUDA/GPU world, accelerators seem to be converging a bit. We're going to need a whole lot of tooling, hopefully relatively 'meta'.
Networks of cores ... Congrats you have just taken a computer and shrunk it so there are many on a single chip ... Just gonna say here AWS does exactly this network of computers thing ... Might be profitable
What I want is a Linear Algebra interface - As Gilbert Strang taught it. I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.
I'm not willing to even know about the HW at all, the higher level my code the more opportunities for the JIT to optimize my code.
What I really want is something like Mathematica that can JIT to GPU.
As another commenter mentioned all the API's assume you're a discrete GPU off the end of a slow bus, without shared memory. I would kill for an APU that could freely allocate memory for GPU or CPU and change ownership with the speed of a pagefault or kernel transition.
> What I really want is something like Mathematica that can JIT to GPU.
https://juliagpu.org/
https://github.com/jax-ml/jax
To expand on this link, this is probably the closest you're going to get to 'I'll "program" in LinAlg, and a JIT can compile it to whatever wonky way your HW requires.' right now. JAX implements a good portion of the Numpy interface - which is the most common interface for linear algebra-heavy code in Python - so you can often just write Numpy code, but with `jax.numpy` instead of `numpy`, then wrap it in a `jax.jit` to have it run on the GPU.
I was about to say that it is literally just Jax.
It genuinely deserves to exist alongside pytorch. It's not just Google's latest framework that you're forced to use to target TPUs.
Like, PyTorch? And the new Mac minis have 512gb of unified memory
You can have that today. Just go out and buy more CPUs until they have enough cores to equal the number of SMs in your GPU (or memory bandwidth, or whatever). The problem is that the overhead of being general purpose -- prefetch, speculative execution, permissions, complex shared cache hierarchies, etc -- comes at a cost. I wish it was free, too. Everyone does. But it just isn't. If you have a workload that can jettison or amortize these costs due to being embarrassingly parallel, the winning strategy is to do so, and those workloads are common enough that we have hardware for column A and hardware for column B.
> The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths.
To me it feels somewhat like programming for the segmented memory model with its near and far pointers, back in the old days. What a nightmare.
Larrabee was something like that, didn't took off.
IMHO, the real issue is cache coherence. GPUs are spared from doing a lot of extra work by relaxing coherence guarantees quite a bit.
Regarding the vendor situation - that's basically how most of computing hardware is, save for the PC platform. And this exception is due to Microsoft successfully commoditizing their complements (which caused quite some woe on the software side back then).
Is cache coherence a real issue, absent cache contention? AIUI, cache coherence protocols are sophisticated enough that they should readily adapt to workloads where the same physical memory locations are mostly not accessed concurrently except in pure "read only" mode. So even with a single global address space, it should be possible to make this work well enough if the programs are written as if they were running on separate memories.
It is because cache coherence requires extra communication to make sure that the cache is coherent. There's cute stratgies for reducing the traffic, but ultimately you need to broadcast out reservations to all of the other cache coherent nodes, so there's an N^2 scaling at play.
I miss, not exactly Larrabee, but what it could have become. I want just an insane number of very fast, very small cores with their own local memory.
In the field usually nothing takes off on the first attempt, so this is just a reason to ask "what's different this time" on the following attempts.
> What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory...
I too am very interested in this model. The Linux kernel supports up to 4,096 cores [1] on a single machine. In practice, you can rent a c7a.metal-48xl [2] instance on AWS EC2 with 192 vCPU cores. As for programming models, I personally find the Java Streams API [3] extremely versatile for many programming workloads. It effectively gives a linear speedup on serial streams for free (with some caveats). If you need something more sophisticated, you can look into OpenMP [4], an API for shared-memory parallelization.
I agree it is time for some new ideas in this space.
[1]: https://www.phoronix.com/news/Perf-Support-2048-To-4096-Core...
[2]: https://aws.amazon.com/ec2/instance-types/c7a/
[3]: https://docs.oracle.com/en/java/javase/24/docs/api/java.base...
[4]: https://docs.alliancecan.ca/wiki/OpenMP
Yep, and those printers are proprietary and mutually incompatible, and there are buggy mutually incompatible serial drivers on all the platforms which results in unique code paths and debugging & workarounds for app breaking bugs for each (platform, printer brand, printer model year) tuple combo.
(That was idealized - actually there may be ~5 alternative driver APIs even on a single platform each with its own strengths)
I really would like you to sketch out the DX you are expecting here, purely for my understanding of what it is you are looking for.
I find needing to write seperate code in a different language annoying but the UX of it is very explicit of what is happening in the memory which is very useful. With really high performance compute across multiple cores ensuring you don't get arbitrary cache misses is a pain. If we could address CPUs like we address current GPUs( well you can but it's not generally done) it would make it much much simpler.
Want to alter something in parallel, copy it to memory allocated to a specific core which is guaranteed to only be addressed by that core and the do the operations on it.
To do that currently you need to be pedantic about alignment and manually indicate thread affinity to the scheduler etc. Which ia entirely as annoying as GPU programming.
Your wish sounds to me a lot like Larrabee/Xeon Phi or manycore CPUs. Maybe I am misunderstanding something, but it sounds like a good idea to me and I don’t totally see why it inherently can’t compete with GPUs.
I think Intel should have made more of an effort to get cheap Larrabee boards to developers, they could have been ones with chips that had some broken cores or unable to make the design speed.
RAM size seemed to have been a problem, lowest end Phi only had 6GB GDDR5 for its 57 cores(228 threads).
doesn't matter. the issues you raise are abstractable at the language level, or maybe even the runtime. unfortunately there are others like which of the many kinds of parallelism to use (ILP, thread, vector/SIMD, distributed memory with much lower performance, etc.) that are harder to hide behind a compiler with acceptable performance.
Please explain how these "worker cores" should operate.
So greenarrays F18? :)
"want to protect their turd" - golden!