Comment by jonplackett
7 months ago
It’s pretty amazing people are still finding ways to make video smaller.
Is this just people being clever or is it also more processing power being thrown at the problem when decoding / encoding?
7 months ago
It’s pretty amazing people are still finding ways to make video smaller.
Is this just people being clever or is it also more processing power being thrown at the problem when decoding / encoding?
Yes, and it's allowing the format to change to allow more cleverness or apply more processing power.
For example, changes from one frame to the next are encoded in rectangular areas called "superblocks" (similar to a https://en.wikipedia.org/wiki/Macroblock). You can "move" the blocks (warp them), define their change in terms of other parts of the same frame (intra-frame prediction) or by referencing previous frames (inter-frame prediction), and so on... but you have to do it within a block, as that's the basic element of the encoding.
The more tightly you can define blocks around the areas that are actually changing from frame to frame, the better. Also, it takes data to describe where these blocks are, so there are special limitations on how blocks are defined, to minimise how many bits are needed to describe them.
AV2 now lets you define blocks differently, which makes it easier to fit them around the areas of the frame that are changing. It has also doubled the size of the largest block, so if you have some really big movement on screen, it takes fewer blocks to encode that.
That's just one change, the headline improvement comes from all the different changes, but this is an important one.
There is new cleverness in the encoders, but they need to be given the tools to express that cleverness -- new agreement about what types of transforms, predictions, etc. are allowed and can be encoded in the bitstream.
https://youtu.be/Se8E_SUlU3w?t=242
In general with movement through scenes it would seem that rectangular update windows seem like a poor match.
Is there a reason codec's don't use the previous frame(s) as stored textures, and remap them on the screen? I can move a camera through room and a lot of the texture is just reprojectivetransformed.
> I can move a camera through room
That's what AV1 calls global motion and warped motion. Motion deltas (translation/rotation/scaling) can be applied to the whole frame, and blocks can be sheared vertically/horizontally as well as moved.
1 reply →
>Is there a reason codec's don't use the previous frame(s) as stored textures, and remap them on the screen? I can move a camera through room and a lot of the texture is just reprojectivetransformed.
I mean, that's more or less how it works already. But you still need a unit of granularity for the remapping. So the frame will store eg this block moves by this shift, this block by that shift etc.
2 replies →
I believe patents play a big role here as well. Anything new must be careful to not (accidentally) violate any active patent, so there might be some tricks that can't currently be used for AV1/AV2
I think patents are quickly becoming less of a problem. A lot of the foundational encoding techniques have exited patent protection. H.264 and everything before it is patent free now.
It's true you could still accidentally violate a patent but that minefield is clearing out as those patents simply have to become more esoteric in nature.
...till someone decides to patent one of the new techniques used
1 reply →
There are numerous patent trolls in this space with active litigation against many of the participants in the consortium who brought AV1. The EU was also threatening to investigate (likely to protect the royalty revenues of European companies)
It has always seemed very weird to me that compression algorithms were patentable.
1) it harms interoperability
2) I thought math wasn’t patentable?
A bit of both. Also, the modern Codecs have slightly different tradeoffs (image quality (PSNR, SSIM), computational complexity (CPU vs DSP vs Memory), storage requirements, bit rate) and therefore there isn't one that is best for every use case.
I wonder when we will see generative AI codecs in production. The concept seems simple enough, the encoder knows the exact model the decoder will use to generate the final image starting from a handful of pixels, and optimizes towards lowest bitrate and minimum subjective quality loss, for example, by letting the decoder generate a random human face in the crowd, or give it more data in that area to steer it towards the face of the team maskot, as the case may be.
At the absolute compression limit, it's no longer video, but a machine description of the scene conceptually equivalent to a textual script.
There was nvidia videoo upsampling or w/e it is called. It was putting age spots on every face when it was blurry and it used too much resources as far as I can remember
And then that script gets processed on hundreds of GPUs in the cloud and the video gets streamed to the client. Wait.
New video codecs typically offer more options for how to represent the current frame in terms of other frames. That typically means more processing for the encoder, because it can check for all the similarities to see what works best; there's also harder math for arithmetic coding of the picture data. It will be more work for the encoder if it needs to keep more reference images and especially if it needs to do harder transformations, or if arithemetic decoding gets harder.
Clever matters a lot more for encoding. If you can determine good ways to figure out the motion information without trying them all, that gets you faster encoding speed. Decoding doesn't tend to have as much room for cleverness; the stream says to calculate the output from specific data, so you need to do that.
I don’t know the details of AV2, but going from h.265 to h.266, the number of angles for angular prediction doubled, they added a tool to predict chroma from luma, added the ability to do pixel block copies and a bunch of other techniques… And that’s just for intra predictions. They also added tons of new inter prediction techniques.
All of this requires a significant amount of extra logic gates/silicon area for hardware decoders, but the bit rate reduction is worth it.
For CPU decoders, the additional computational load is not so bad.
The real additional cost is for encoding because there’s more prediction tools to choose from for optimal compression. That’s why Google only does AV1 encoding for videos that are very popular: it doesn’t make sense to do it on videos that are seen by few.
Iirc Facebook did the selective encoding too. And it would predict which videos would be popular so even the first streams would get the AV1 version.
It’s more money and more user’s compute being thrown at the problem to get the streaming service’s CDN bill down.
While funny, that's not really what I would call accurate. Users get reduced data consumption, potentially higher quality selection if the bandwidth now allows for a higher resolution to be streamed, and possibly lower disk usage should they decide to offline the videos.
Better codecs are an overall win for everyone involved.
> Better codecs are an overall win for everyone involved.
I don’t remember ever watching a movie and wishing for a better codec, in the last 10 years
3 replies →
> Users get reduced data consumption, potentially higher quality selection if the bandwidth now allows for a higher resolution to be streamed
They also get increased power usage, lesser battery life, higher energy bills, and potentially earlier device failures.
> Better codecs are an overall win for everyone involved.
Right.
4 replies →
Modern video codecs are what broke the telco monopoly on content and gave us streaming services in the first place. If the cdn bill is make or break, the service isn’t going to last.
And there’s no transfer of effort to the user. Compute complexity of video codecs is asymmetric. The decode is several order of magnitude cheaper to compute than the encode. And in every case, the principal barrier to codec adoption has been hardware acceleration. Pretty much every device on earth has a hardware-accelerated h264 decoder.
For those of us who back up media, this can be very appealing as well. I don’t disagree that what you said is a major driving force, but better formats have benefited me and my storage requirements multiple times in the past.
Soon we will just have local AI processors which will just make stuff up between scenes but adhere to a “close enough” guideline where all narratively critical elements are maintained but other things (eg landscapes or trees) will be generated locally. Movies will practically be long cutscenes with photorealistic graphics.
I'm sure models which replace characters in realtime will also become popular. I would imagine some company thinking it would be cool if the main character looked slightly more like whatever main audience it's being shown to and it's done on their playback devices (so, of course, it can be customized or turned off).
I find the idea fun, kinda like using snapchat filters on characters, but in practice I'm sure it'll be used to cut corners and prevent the actual creative vision from being shown which saddens me.
At that point we aren’t even all watching the same movies. Which could be interesting. But very different—I mean, even stuff like talking with your friends about a movie you saw will change drastically. Maybe a service could be centered around sharing your movie prompts so have a shared movie experience to talk to your friends about.
Entertainment is becoming increasingly customizable and personalized. It’ll get to the point, like you said, that we’re not watching the same movie, playing the same game, etc.
It feels like we’re losing something, a shared experience, in favor of an increasingly narcissistic attitude that everything needs to shapeable to individual preferences instead of accepting things as they are.
2 replies →