> Since a single strip has a memory footprint of 64 bytes and a single alpha value is stored as u8, the necessary storage amounts to around 259 ∗ 64 + 7296 ≈ 24KB
am I missing something, or is it actually 259*8 + 7296 ≈ 9KB?
Admittedly I won't have time to go through the code. However, a quick look at the thesis, there's a section on multi-threading.
Whilst it's still very possible this was a simple mistake, an alternate explanation could be that each strip is allocated to a unique cache line. On modern x86_64 systems, a cache line is 64 bytes. If the renderer is attempting to mitigate false sharing, then it may be allocating each strip in its own cache line, instead of contiguously in memory.
i think you are correct, memory use of the implementation is overestimated in that paragraph, as you suggest it is lower. from a quick skim read, the benchmarks section focuses on comparing running time against other libraries, there isn't a comparison of storage.
there were at least two renderers written for the CM2 that used strips. at least one of them used scans and general communication, most likely both.
1) for the given processor set, where each process holds an object 'spawn' a processor in a new set, one processor for each span.
(a) spawn operation consists of the source processor setting the number of nodes in the new domain, then performing an add-scan, then sending the total allocation back to the front end
the front end then allocates a new power-of-2 shape than can hold those
the object-set then uses general communication to send scan information to the first of these in the strip-set (the address is left over from the scan)
(b) in the strip-set, use a mask-copy-scan to get all the parameters to all the the elements of the scan set.
(c) each of these elements of the strip set determine the pixel location of the leftmost element
(d) use a general send to seed the strip with the parameters of the strip
(e) scan those using a mask-copy-scan in the pixel-set
(f) apply the shader or the interpolation in the pixel-set
note that steps (d) and (e) also depend on encoding the depth information in the high bits and using a max combiner to perform z-buffering.
Edit: there must have been an additional span/scan in a pixel space that is then sent to image space with z buffering, otherwise strip seeds could collide, and be sorted by z which may miss pixels from the losing strip
Fascinating project. Based on section 3.9, it seems the output is in the form of a bitmap. So I assume you have to do a full memory copy to the GPU to display the image in the end. With skia moving to WebGPU[0] and with WebGPU supporting compute shaders, I feel that 2D graphics is slowly becoming a solved problem in terms of portability and performance. Of course there are cases where you would a want a CPU renderer. Interestingly the web is sort of one of them because you have to compile shaders at runtime on page load. I wonder if it could make sense in theory to have multiple stages to this, sort of like how JS JITs work, were you would start with a CPU renderer while the GPU compiles its shaders. Another benefit, as the author mentions, is binary size. WebGPU (via dawn at least) is rather large.
The output of this renderer is a bitmap, so you have to do an upload to GPU if that's what your environment is. As part of the larger work, we also have Vello Hybrid which does the geometry on CPU but the pixel painting on GPU.
We have definitely thought about having the CPU renderer while the shaders are being compiled (shader compilation is a problem) but haven't implemented it.
In any interactive environment you have to upload to the GPU on each frame to output to a display, right? Or maybe integrated SoCs can skip that? Of course you only need to upload the dirty rects, but in the worst case the full image.
>geometry on CPU but the pixel painting on GPU
Wow. Is this akin to running just the vertex shader on the CPU?
One place where a CPU renderer is particularly useful is in test runners (where the output of the test is a image/screenshot). Or I guess any other use cases where the output is an image. In that case, the output never needs to get to the GPU, and indeed if you render on the GPU then you have to copy the image back!
Unfortunately graphics APIs suck pretty hard when it comes to actually sharing memory between CPU and GPU. A copy is definitely required when using WebGPU, and also on discrete cards (which is what these APIs were originally designed for). It's possible that using native APIs directly would let us avoid copies, but we haven't done that.
This looks interesting; recently I wrote some code for rendering high precision N-body paths with millions of vertices[0], I wonder if a GPU implementation this RLE representation would work well and maintain simplicity.
This was the original goal of the Cornell box (https://en.wikipedia.org/wiki/Cornell_box, i.e. carefully measure the radiosity of a simple, real-world scene and then see how closely you can come to simulating it).
For realtime rendering a common thing to do is to benchmark against a known-good offline renderer (e.g. Arnold, Octane)
Correctness of what exactly? It's a "render" of reality-like environment, so all of them make some tradeoff somewhere, and won't be 100% "correct" at least compared to reality :)
Bezier curves can generate degenerate geometry when flattened and stroke geometry has to handle edge cases. See for instance the illustration on the last page of the Polar Stroking paper: https://arxiv.org/pdf/2007.00308
There are also things like interpretting (conflating) coverage as alpha for analytical antialiasing methods, which lead to visible hairline cracks.
Correctness with respect to the benchmark. A slow reference renderer could produce the target image, and renderers need to achieve either exact or close reproduction to the reference. Otherwise, you could just make substantial approximations and claim a performance victory.
The paper defines this structure
which looks like it is 64 bits (8 bytes) in size,
and then says
> Since a single strip has a memory footprint of 64 bytes and a single alpha value is stored as u8, the necessary storage amounts to around 259 ∗ 64 + 7296 ≈ 24KB
am I missing something, or is it actually 259*8 + 7296 ≈ 9KB?
Admittedly I won't have time to go through the code. However, a quick look at the thesis, there's a section on multi-threading.
Whilst it's still very possible this was a simple mistake, an alternate explanation could be that each strip is allocated to a unique cache line. On modern x86_64 systems, a cache line is 64 bytes. If the renderer is attempting to mitigate false sharing, then it may be allocating each strip in its own cache line, instead of contiguously in memory.
i think you are correct, memory use of the implementation is overestimated in that paragraph, as you suggest it is lower. from a quick skim read, the benchmarks section focuses on comparing running time against other libraries, there isn't a comparison of storage.
Also checkout blaze: https://gasiulis.name/parallel-rasterization-on-cpu/
Thanks for the pointer, we were not actually aware of this, and the claimed benchmark numbers look really impressive.
there were at least two renderers written for the CM2 that used strips. at least one of them used scans and general communication, most likely both.
1) for the given processor set, where each process holds an object 'spawn' a processor in a new set, one processor for each span. (a) spawn operation consists of the source processor setting the number of nodes in the new domain, then performing an add-scan, then sending the total allocation back to the front end the front end then allocates a new power-of-2 shape than can hold those the object-set then uses general communication to send scan information to the first of these in the strip-set (the address is left over from the scan) (b) in the strip-set, use a mask-copy-scan to get all the parameters to all the the elements of the scan set. (c) each of these elements of the strip set determine the pixel location of the leftmost element (d) use a general send to seed the strip with the parameters of the strip (e) scan those using a mask-copy-scan in the pixel-set (f) apply the shader or the interpolation in the pixel-set
note that steps (d) and (e) also depend on encoding the depth information in the high bits and using a max combiner to perform z-buffering.
Edit: there must have been an additional span/scan in a pixel space that is then sent to image space with z buffering, otherwise strip seeds could collide, and be sorted by z which may miss pixels from the losing strip
the demo is astonishing
Fascinating project. Based on section 3.9, it seems the output is in the form of a bitmap. So I assume you have to do a full memory copy to the GPU to display the image in the end. With skia moving to WebGPU[0] and with WebGPU supporting compute shaders, I feel that 2D graphics is slowly becoming a solved problem in terms of portability and performance. Of course there are cases where you would a want a CPU renderer. Interestingly the web is sort of one of them because you have to compile shaders at runtime on page load. I wonder if it could make sense in theory to have multiple stages to this, sort of like how JS JITs work, were you would start with a CPU renderer while the GPU compiles its shaders. Another benefit, as the author mentions, is binary size. WebGPU (via dawn at least) is rather large.
[0] https://blog.chromium.org/2025/07/introducing-skia-graphite-...
The output of this renderer is a bitmap, so you have to do an upload to GPU if that's what your environment is. As part of the larger work, we also have Vello Hybrid which does the geometry on CPU but the pixel painting on GPU.
We have definitely thought about having the CPU renderer while the shaders are being compiled (shader compilation is a problem) but haven't implemented it.
In any interactive environment you have to upload to the GPU on each frame to output to a display, right? Or maybe integrated SoCs can skip that? Of course you only need to upload the dirty rects, but in the worst case the full image.
>geometry on CPU but the pixel painting on GPU
Wow. Is this akin to running just the vertex shader on the CPU?
2 replies →
One place where a CPU renderer is particularly useful is in test runners (where the output of the test is a image/screenshot). Or I guess any other use cases where the output is an image. In that case, the output never needs to get to the GPU, and indeed if you render on the GPU then you have to copy the image back!
> "I assume you have to do a full memory copy to the GPU to display the image in the end."
On a unified memory architecture (eg: Apple Silicon), that's not an expensive operation. No copy required.
Unfortunately graphics APIs suck pretty hard when it comes to actually sharing memory between CPU and GPU. A copy is definitely required when using WebGPU, and also on discrete cards (which is what these APIs were originally designed for). It's possible that using native APIs directly would let us avoid copies, but we haven't done that.
This looks interesting; recently I wrote some code for rendering high precision N-body paths with millions of vertices[0], I wonder if a GPU implementation this RLE representation would work well and maintain simplicity.
[0] https://www.youtube.com/watch?v=rmyA9AE3hzM
Side question. Is there some kind of benchmark to test the correctness of renderers?
This was the original goal of the Cornell box (https://en.wikipedia.org/wiki/Cornell_box, i.e. carefully measure the radiosity of a simple, real-world scene and then see how closely you can come to simulating it).
For realtime rendering a common thing to do is to benchmark against a known-good offline renderer (e.g. Arnold, Octane)
Correctness of what exactly? It's a "render" of reality-like environment, so all of them make some tradeoff somewhere, and won't be 100% "correct" at least compared to reality :)
Bezier curves can generate degenerate geometry when flattened and stroke geometry has to handle edge cases. See for instance the illustration on the last page of the Polar Stroking paper: https://arxiv.org/pdf/2007.00308
There are also things like interpretting (conflating) coverage as alpha for analytical antialiasing methods, which lead to visible hairline cracks.
Correctness with respect to the benchmark. A slow reference renderer could produce the target image, and renderers need to achieve either exact or close reproduction to the reference. Otherwise, you could just make substantial approximations and claim a performance victory.