← Back to context

Comment by 1zael

15 hours ago

The Vulkan compute shader implementations are cool...particularly for FFv1 and ProRes RAW. Given that these bypass fixed-function hardware decoders entirely, I'm curious about the memory bandwidth implications. FFv1's context-adaptive arithmetic coding seems inherently sequential, yet they're achieving "very significant speedups."

Are they using wavefront/subgroup operations to parallelize the range decoder across multiple symbols simultaneously? Or exploiting the slice-level parallelism with each workgroup handling independent slices? The arithmetic coding dependency chain has traditionally been the bottleneck for GPU acceleration of these codecs.

I'd love to hear from anyone who's profiled the compute shader implementation - particularly interested in the occupancy vs. bandwidth tradeoff they've chosen for the entropy decoding stage.