Comment by solaarphunk
1 month ago
I've been building something similar (GPU-native OS research project) and wanted to share a mental model shift that unlocked things for me.
The question "why run CPU code on GPU when GPU cores are slower?" assumes you're running ONE program. But GPUs execute in SIMD wavefronts of 32 threads - and here's the trick: each of those 32 lanes can run a DIFFERENT process. Same instruction, different data. Calculator on lane 0, text editor on lane 1, file indexer on lane 2. No divergence, legal SIMD, full utilization. Suddenly you're not running "slow CPU code on GPU" - you're running 32 independent programs in parallel on hardware designed for exactly this pattern.
The win isn't throughput for compute-heavy code. It's eliminating CPU roundtrips for interactive stuff. Every kernel launch, every synchronization, every "GPU done, back to CPU, dispatch next thing" adds latency. A persistent kernel that polls for input, updates state, and renders - all without returning to CPU - changes the responsiveness equation entirely.
A few things to try at home if you're curious:
1. Write a Metal/CUDA kernel with while(true) and an atomic shutdown flag. See how long it runs. (Spoiler: indefinitely, if you do it right)
2. Put 32 different "process states" in a buffer and have each SIMD lane execute instructions for its own process. Watch all 32 make progress simultaneously.
3. Measure the latency from "input event" to "pixel on screen" with CPU orchestration vs GPU polling an input queue directly. The difference surprised me.
The persistent kernel thing has a nasty gotcha though - ALL 32 threads must participate in the while loop. If you do if (tid != 0) return; then while(true), it'll work for a few million iterations then hard-lock. Ask me how I know.
If you're running vastly different processes in different ALU lanes, the single master "program" that comprises them all is effectively an interpreter. And then it's hard to have the exact same control flow lead to vastly different effects in different processes, especially once you account for branches. This works well for inference batches since those are essentially about straight-line processing, but not much else.
It'll go much faster if you give each process a warp instead of a thread. That means each process has its own IP and set of vector registers, and when your editor takes a different branch to your browser, no cost.