There is absolutely no way an FPGA would make sense. The requirements for AV1 and H265 far exceed the hardware resources of lower budget FPGAs.
For the same process, FPGA logic density is about 40x lower than ASIC, and lower budget FPGAs use older processes.
A h265 or AV1 decoder requires millions of logic gates (and DRAM memory bandwidth.) Only high-end FPGAs provide that.
There's mention that the decode could get a lot easier. Here's an H264 core that runs on older lattice chips and only takes 56k luts. https://www.latticesemi.com/products/designsoftwareandip/int... . Microchip's polarfires have a soft H.264 core as well taking under 20k. If AV2 will really be easier for hardware to implement, it might work out. Here's another example, H 264 decode in an artix 7, can do 1080p60 https://www.cast-inc.com/compression/avc-hevc-video-compress... . So with all due respect, what in the world are you talking about?
I didn't mention h264 for a reason. It's a codec that was developed 25 years ago.
The complexity of video decoders has been going up exponentially and AV2 is no exception. Throwing more tools (and thus resources) at it is the only way to increase compression ratio.
Take AV1. It has CTBs that are 128x128 pixels. For intra prediction, you need to keep track of 256 neighboring pixels above the current CTB and 128 to the left. And you need to do this for YUV. For 420, that means you need to keep track of (256+128 + 2x(128+64)) = 768 pixels. At 8 bits per component, that's 8x768=6144 flip-flops. That's just for neighboring pixel tracking, which is only a tiny fraction of what you need to do, a few % of the total resources.
These neighbor tracking flip-flops are followed by a gigantic multiplexer, which is incredibly inefficient on FPGAs and it devours LUTs and routing resources.
A Lattice ECP5-85 has 85K LUTs. The FFs alone consume 8% of the FPGA. The multiplier probably another conservative 20%. You haven't even started to calculate anything and your FPGA is already almost 30% full.
FWIW, for h264, the equivalent of that 128x128 pixel CTB is 16x16 pixel MB. Instead of 768 neighboring pixels, you only need 16+32+2*(8+16)=96 pixels. See the difference? AV2 retains the 128x128 CTB size of AV1 and if it adds something like MRL of h.266, the number of neighbors will more than double.
H264 is child's play compared later codecs. It only has a handful of angular prediction modes, it has barely any pre-angular filtering, it has no chroma from luma prediction, it only has a weak deblocking filter and no loop filtering. It only has one DCT mode. The coding tree is trivial too. Its entropy decoder and syntax processing is low in complexity compared to later codecs. It doesn't have intra-block copy. Etc. etc.
Working on a hardware video decoder is my day job. I know exactly what I'm talking about, and, with all due respect, you clearly do not.
It’s not possible on any but the largest $$$ FPGAs… and even then we often need to partition over multiple FPGAs to make it fit. And it will only run at a fraction of the target clock speed.
There is absolutely no way an FPGA would make sense. The requirements for AV1 and H265 far exceed the hardware resources of lower budget FPGAs. For the same process, FPGA logic density is about 40x lower than ASIC, and lower budget FPGAs use older processes.
A h265 or AV1 decoder requires millions of logic gates (and DRAM memory bandwidth.) Only high-end FPGAs provide that.
There's mention that the decode could get a lot easier. Here's an H264 core that runs on older lattice chips and only takes 56k luts. https://www.latticesemi.com/products/designsoftwareandip/int... . Microchip's polarfires have a soft H.264 core as well taking under 20k. If AV2 will really be easier for hardware to implement, it might work out. Here's another example, H 264 decode in an artix 7, can do 1080p60 https://www.cast-inc.com/compression/avc-hevc-video-compress... . So with all due respect, what in the world are you talking about?
I didn't mention h264 for a reason. It's a codec that was developed 25 years ago.
The complexity of video decoders has been going up exponentially and AV2 is no exception. Throwing more tools (and thus resources) at it is the only way to increase compression ratio.
Take AV1. It has CTBs that are 128x128 pixels. For intra prediction, you need to keep track of 256 neighboring pixels above the current CTB and 128 to the left. And you need to do this for YUV. For 420, that means you need to keep track of (256+128 + 2x(128+64)) = 768 pixels. At 8 bits per component, that's 8x768=6144 flip-flops. That's just for neighboring pixel tracking, which is only a tiny fraction of what you need to do, a few % of the total resources.
These neighbor tracking flip-flops are followed by a gigantic multiplexer, which is incredibly inefficient on FPGAs and it devours LUTs and routing resources.
A Lattice ECP5-85 has 85K LUTs. The FFs alone consume 8% of the FPGA. The multiplier probably another conservative 20%. You haven't even started to calculate anything and your FPGA is already almost 30% full.
FWIW, for h264, the equivalent of that 128x128 pixel CTB is 16x16 pixel MB. Instead of 768 neighboring pixels, you only need 16+32+2*(8+16)=96 pixels. See the difference? AV2 retains the 128x128 CTB size of AV1 and if it adds something like MRL of h.266, the number of neighbors will more than double.
H264 is child's play compared later codecs. It only has a handful of angular prediction modes, it has barely any pre-angular filtering, it has no chroma from luma prediction, it only has a weak deblocking filter and no loop filtering. It only has one DCT mode. The coding tree is trivial too. Its entropy decoder and syntax processing is low in complexity compared to later codecs. It doesn't have intra-block copy. Etc. etc.
Working on a hardware video decoder is my day job. I know exactly what I'm talking about, and, with all due respect, you clearly do not.
2 replies →
I think when they talk about AV2 being more hardware friendly they mean compared to AV1 not H.264.
1 reply →
Can a "lower budget" FPGA really outperform a consumer-grade CPU for this?
And what hobbyist is sending off decoding chips to be fabbed? If this exists, it sounds interesting if incredibly impractical.
It’s not possible on any but the largest $$$ FPGAs… and even then we often need to partition over multiple FPGAs to make it fit. And it will only run at a fraction of the target clock speed.