← Back to context

Comment by perching_aix

2 months ago

I've had a lot of misconceptions that I had to contend with over the years myself as well. Maybe this thread is a good opportunity to air the biggest one of those. Additionally, I'll touch on subbing at the end, since the post specifically calls it out.

My biggest misconception, bar none, was around what a codec is exactly, and how well specified they are. I'd keep hearing downright mythical sounding claims, such as how different hardware and software encoders, and even decoders, produce different quality outputs.

This sounded absolutely mental to me. I thought that when someone said AVC / H.264, then there was some specification somewhere, that was then implemented, and that's it. I could not for the life of me even begin to fathom where differences in quality might seep in. Chief of this was when somebody claimed using single threaded encoding instead of multi threaded encoding was superior. I legitimately considered I was being messed with, or that the person I was talking to simply didn't know what they were talking about.

My initial thoughts on this were that okay, maybe there's a specification, and the various codec implementations just "creatively interpret" these. This made intuitive sense to me because "de jure" and "de facto" distinctions are immensely common in the real world, be it for laws, standards, what have you. So I'd start differentiating and going "okay so this is H.264 but <implementation name>". I was pretty happy with this, but eventually, something felt off enough to make me start digging again.

And then, not even a very long time ago, the mystery unraveled. What the various codec specifications actually describe, and what these codecs actually "are", is the on-disk bitstream format, and how to decode it. Just the decode. Never the encode. This applies to video, image, and sound formats; all lossy media formats. Except for telephony, all these codecs only ever specify the end result and how to decode that, but not the way to get there.

And so suddenly, the differences between implementations made sense. It isn't that they're flaunting the standard: for the encoding step, there simply isn't one. The various codec implementations are to compete on finding the "best" way to compress information to the same cross-compatibly decode-able bitstream. It is the individual encoders' responsibility to craft a so-called psychovisual or psychoacoustic model, and then build a compute-efficient encoder that can get you the most bang for the buck. This is how you get differences between different hardware and software encoders, and how you can get differences even between single and multi-threaded codepaths of the same encoder. Some of the approaches they chose might simply not work or work well with multi threading.

One question that escaped me then was how can e.g. "HEVC / H.265" be "more optimal" than "AVC / H.264" if all these standards define is the end result and how to decode that end result. The answer is actually kinda trivial: more features. Literally just more knobs to tweak. These of course introduce some overhead, so the question becomes, can you reliably beat this overhead to achieve parity, or gain efficiency. The OP claims this is not a foregone conclusion, but doesn't substantiate. In my anecdotal experience, it is: parity or even efficiency gain is pretty much guaranteed.

Finally, I mentioned differences between decoder output quality. That is a bit more boring. It is usually a matter of fault tolerance, and indeed, standards violations, such as supporting a 10 bit format in H.264 when the standard (supposedly, never checked) only specifies 8-bit. And of course, just basic incorrectness / bugs.

Regarding subbing then, unless you're burning in subs (called hard-subs), all this malarkey about encoding doesn't actually matter. The only thing you really need to know about is subtitle formats and media containers. OP's writing is not really for you.

I was a DVD programmer for 10 years. There was a defined DVD spec. The problem is that not every DVD device adhered to the spec. Specs contain words like shall/must and other words that can be misinterpreted, and then you have people that build MVP as a product that do not worry about the more advanced portion of the spec.

As a specific example, the DVD software had a random feature that could be used. There was one brand of player that had a preset list of random numbers so that every time you played a disc that used random, the random would be the exact same every time. This made designing DVD-Video games "interesting" as not all players behaved the same.

This was when I first became aware that just because there's a spec doesn't mean you can count on the spec being followed in the same way everywhere. As you mentioned, video decoders also play fast and loose with specs. That's why some players cannot decode the 10-bit encodes as that's an "advanced" feature. Some players could not decode all of the profiles/levels a codec could use according to the spec. Apple's QTPlayer could not decode the more advanced profiles/levels just to show that it's not "small" devs making limited decoders.

The issue is encoding is an art, especially as it's lossy. You choose how much data to throw away (kind of like when you pick a quality in JPEG). Further, for video, you generally try to encode the differences between 2 frames. Again, because it's a lossy difference, it's up to the creator of the encoder to decide how to compute the difference. different algorithms come up with a different answers. There result still fits the spec.

Let's just say we were encoding a list of numbers. So we get a keyframe (an exact number) and then all frames after that until the next keyframe are just deltas. How much to add to that keyframe number

    keyframe = 123
    nextFrame += 2   // result = 125
    nextFrame += 3   // result = 128
    nextFrame -= 1   // result = 127

etc... A different encoder might have different deltas. When it comes to video, those difference are likely relatively subtle, tho some definitely look better than others.

The "spec" or "codec" only defines that each frame is encoded as a delta. it doesn't say what those detlas are or how they are computed, only how they are applied.

This is also why most video encoding software has quality settings and those settings often includely the fact higher quality is slower. Some of those settings are about bitrate or bitdepth or other things but others are about how much time is spent looking for the perfect or better delta values to get closer to the original image as searching for bettter matches takes time. Especially because it's lossy, there is no "correct" answer. There is just opinion.

> And then, not even a very long time ago, the mystery unraveled. What the various codec specifications actually describe, and what these codecs actually "are", is the on-disk bitstream format, and how to decode it. Just the decode. Never the encode.

Soooo with everyone getting used to creative names instead of descriptive names over the past decade or two, I guess "codec" just became a blob and it just never crosses peoples' minds that this is right there in the name: COding/DECoding. No ENCoding.

  • There's a term overload involved. In implementation terms, codec stands for coder/decoder, with "coder" referring exactly to an encoding capability: https://en.wikipedia.org/wiki/Codec

    So that's a swing and a miss I'm afraid. But I'm very interested to hear what do you think a "coder" library does in this context if not encode, and why is it juxtaposed with "decoder" if not for doing the exact opposite.

Thanks for bringing this up, since I'm realizing that I did not explicitly spell this out in the post. I'll add a paragraph making this even clearer.

what if I told you the same issue is true for lossless plain compression like .zip files

the compressor (encoder) decides exactly how to pack the data, it's not deterministic, you can do a better job at it or a worse one

which is why we have "better" zlib implementations which compress more tightly

  • Drives me crazy but I'm glad to learn of it :D

    Makes a lot of sense in retrospect, to the extent it bothers me I haven't figured it out myself earlier.

    • this is exactly what "higher" compression levels do (among other things like bigger dictionary) - they try harder, more iterations, to find the optimum combination of available knobs for a particular chunk of data.

      2 replies →