Comment by fngjdflmdflg
3 months ago
Fascinating project. Based on section 3.9, it seems the output is in the form of a bitmap. So I assume you have to do a full memory copy to the GPU to display the image in the end. With skia moving to WebGPU[0] and with WebGPU supporting compute shaders, I feel that 2D graphics is slowly becoming a solved problem in terms of portability and performance. Of course there are cases where you would a want a CPU renderer. Interestingly the web is sort of one of them because you have to compile shaders at runtime on page load. I wonder if it could make sense in theory to have multiple stages to this, sort of like how JS JITs work, were you would start with a CPU renderer while the GPU compiles its shaders. Another benefit, as the author mentions, is binary size. WebGPU (via dawn at least) is rather large.
[0] https://blog.chromium.org/2025/07/introducing-skia-graphite-...
The output of this renderer is a bitmap, so you have to do an upload to GPU if that's what your environment is. As part of the larger work, we also have Vello Hybrid which does the geometry on CPU but the pixel painting on GPU.
We have definitely thought about having the CPU renderer while the shaders are being compiled (shader compilation is a problem) but haven't implemented it.
In any interactive environment you have to upload to the GPU on each frame to output to a display, right? Or maybe integrated SoCs can skip that? Of course you only need to upload the dirty rects, but in the worst case the full image.
>geometry on CPU but the pixel painting on GPU
Wow. Is this akin to running just the vertex shader on the CPU?
It just depends on what architecture your computer has.
On a PC, the CPU typically has exclusive access to system RAM, while the GPU has its own dedicated VRAM. The graphics driver runs code on both the CPU and the GPU since the GPU has its own embedded processor so data is constantly being copied back and forth between the two memory pools.
Mobile platforms like the iPhone or macOS laptops are different: they use unified memory, meaning the CPU and GPU share the same physical RAM. That makes it possible to allocate a Metal surface that both can access, so the CPU can modify it and the GPU can display it directly.
However, you won’t get good frame rates on a MacBook if you try to draw a full-screen, pixel-perfect surface entirely on the CPU it just can’t push pixels that fast. But you can write a software renderer where the CPU updates pixels and the GPU displays them, without copying the surface around.
Surely not if the CPU and video output device share common RAM?
Or with old VGA, the display RAM was mapped to known system RAM addresses and the CPU would write directly to it. (you could write to an off-screen buffer and flip for double/triple buffering)
I regularly do remote VNC and X11 access on stuff like raspberry pi zero and in these cases GPU does not work, you won't be able to open a GL context at all. Also whenever i upadte my kernel on archlinux i'm not able to open a gl context until i reboot, so I really need apps that don't need a gpu context just to show stuff
2 replies →
It's analogous, but vertex shaders are just triangles, and in 2D graphics you have a lot of other stuff going on.
The actual process of fine rasterization happens in quads, so there's a simple vertex shader that runs on GPU, sampling from the geometry buffers that are produced on CPU and uploaded.
One place where a CPU renderer is particularly useful is in test runners (where the output of the test is a image/screenshot). Or I guess any other use cases where the output is an image. In that case, the output never needs to get to the GPU, and indeed if you render on the GPU then you have to copy the image back!
> "I assume you have to do a full memory copy to the GPU to display the image in the end."
On a unified memory architecture (eg: Apple Silicon), that's not an expensive operation. No copy required.
Unfortunately graphics APIs suck pretty hard when it comes to actually sharing memory between CPU and GPU. A copy is definitely required when using WebGPU, and also on discrete cards (which is what these APIs were originally designed for). It's possible that using native APIs directly would let us avoid copies, but we haven't done that.