Comment by tverbeure
7 months ago
There is absolutely no way an FPGA would make sense. The requirements for AV1 and H265 far exceed the hardware resources of lower budget FPGAs. For the same process, FPGA logic density is about 40x lower than ASIC, and lower budget FPGAs use older processes.
A h265 or AV1 decoder requires millions of logic gates (and DRAM memory bandwidth.) Only high-end FPGAs provide that.
There's mention that the decode could get a lot easier. Here's an H264 core that runs on older lattice chips and only takes 56k luts. https://www.latticesemi.com/products/designsoftwareandip/int... . Microchip's polarfires have a soft H.264 core as well taking under 20k. If AV2 will really be easier for hardware to implement, it might work out. Here's another example, H 264 decode in an artix 7, can do 1080p60 https://www.cast-inc.com/compression/avc-hevc-video-compress... . So with all due respect, what in the world are you talking about?
I didn't mention h264 for a reason. It's a codec that was developed 25 years ago.
The complexity of video decoders has been going up exponentially and AV2 is no exception. Throwing more tools (and thus resources) at it is the only way to increase compression ratio.
Take AV1. It has CTBs that are 128x128 pixels. For intra prediction, you need to keep track of 256 neighboring pixels above the current CTB and 128 to the left. And you need to do this for YUV. For 420, that means you need to keep track of (256+128 + 2x(128+64)) = 768 pixels. At 8 bits per component, that's 8x768=6144 flip-flops. That's just for neighboring pixel tracking, which is only a tiny fraction of what you need to do, a few % of the total resources.
These neighbor tracking flip-flops are followed by a gigantic multiplexer, which is incredibly inefficient on FPGAs and it devours LUTs and routing resources.
A Lattice ECP5-85 has 85K LUTs. The FFs alone consume 8% of the FPGA. The multiplier probably another conservative 20%. You haven't even started to calculate anything and your FPGA is already almost 30% full.
FWIW, for h264, the equivalent of that 128x128 pixel CTB is 16x16 pixel MB. Instead of 768 neighboring pixels, you only need 16+32+2*(8+16)=96 pixels. See the difference? AV2 retains the 128x128 CTB size of AV1 and if it adds something like MRL of h.266, the number of neighbors will more than double.
H264 is child's play compared later codecs. It only has a handful of angular prediction modes, it has barely any pre-angular filtering, it has no chroma from luma prediction, it only has a weak deblocking filter and no loop filtering. It only has one DCT mode. The coding tree is trivial too. Its entropy decoder and syntax processing is low in complexity compared to later codecs. It doesn't have intra-block copy. Etc. etc.
Working on a hardware video decoder is my day job. I know exactly what I'm talking about, and, with all due respect, you clearly do not.
Hmmm so you're ignoring the crux of my argument because it's convenient for you (h264 is comfortably small, AV1 is maybe too big, so between them might work). So anything that's related to why AV1 won't fit is pointless. They know that and are improving on it.
Your argument about your large amount of flops is odd. You would only store data that way if you needed everything on the same cycle. You say there's a multiplexor after that. Data storage + multiplexor is just memory. Could use a bram or lutram which would cut down on that dramatically, big if there's a need based on later processing which you haven't defined. And even then, that's for AV1 which isn't AV2 and may change
1 reply →
I think when they talk about AV2 being more hardware friendly they mean compared to AV1 not H.264.
Yeah so if H264 fits comfortably, AV1 is maybe too big, then being better than AV1 could mean it's possible