← Back to context

Comment by londons_explore

18 hours ago

> have limited control over their encode pipeline.

Frustratingly this seems common in many video encoding technologies. The code is opaque, often has special kernel, GPU and hardware interfaces which are often closed source, and by the time you get to the user API (native or browser) it seems all knobs have been abstracted away and simple things like choosing which frame to use as a keyframe are impossible to do.

I had what I thought was a simple usecase for a video codec - I needed to encode two 30 frame videos as small as possible, and I knew the first 15 frames were common between the videos so I wouldn't need to encode that twice.

I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.

A 15 frame min anf max GOP size would do the trick, then you'd get two 15 frame GOPs. Each GOP can be concatenated with another GOP with the same properties (resolution, format, etc) as if they were independent streams. So there is actually a way to do this. This is how video splitting and joining without re encoding works, at GOP boundary.

  • In my case, bandwidth really mattered, so I wanted all one GOP.

    Ended up making a bunch of patches o libx264 to do it, but the compute cost of all the encoding on CPU is crazy high. On the decode side (which runs on consumer devices), we just make the user decode the prefix many times.

I'm on a media engineering team and agree that applying the tech to a new use case often involves people with deep expertise spending a lot of time in the code.

I'd guess there are fewer media/codec engineers around today than there were web developers in 2006. In 2006, Gmail existed, but today's client- and server-side frameworks did not. It was a major bespoke lift to do many things which are "hello world" demos with a modern framework in 2025.

It'd be nice to have more flexible, orthogonal and adaptable interfaces to a lot of this tech, but I don't think the demand for it reaches critical mass.

> I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.

fork()? :-)

But most software, video codec or not, simply isn't written to serialize its state at arbitrary points. Why would it?

  • A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!

    In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.

    However this is not the case with video codecs - but this is just one of many examples of where the video codec landscape is limiting.

    Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content. There is no reasonable way to avoid that - but doing so would reduce the latency to play videos by quite a lot!

    • > A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!

      No, they generally can't save their whole internal state to be resumed later, and definitely not in the document you were editing. For example, when you save a document in vim it doesn't store the mode you were in, or the keyboard macro step that was executing, or the search buffer, or anything like that.

      > In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.

      Serializable in principle, maybe. Actually serializable in the sense that the code contains a way to dump to a file and back, absolutely not. It's extremely rare for programs to expose a way to save and restore from a mid-state in the algorithm they're implementing.

      > Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content.

      Actually, it's extremely common for a video thumbnail to contain extra edits such as overlayed text and other graphics that don't end up in the video itself. It's also very common for the thumbnail to not be the first frame in the video.

      4 replies →

    • > A word processor can save it's state at an arbitrary point...

      As ENTIRE STATE. Video codecs operate on essentially full frame + stream of differences. You might say it's similar to git and you'd be incorrect again, because while with git you can take current state and "go back" using diffs, that is not the case for video, it alwasy go forward from the keyframe and resets on next frame.

      It's fundamentally order of magnitude more complex problem to handle

I wonder if we could scan / test / dig these hidden features somehow ; like in a scrapping / fuzz fashion