Comment by lewq

11 hours ago

Hi, author of the post here. Just fixed up some formatting issues from when we copied it into substack, sorry about that. Yeah, I used Opus 4.5 to help me write it (and it actually made me laugh!). But the struggle was real. Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames. Yes, I wish we could UDP in enterprise networks too, but we can't. The problem actually isn't opening the UDP port, it's hosting UDP on their Kubernetes cluster. "You want to what?? We have ingress. For HTTPS"

Join our discord for private beta in January! https://discord.gg/VJftd844GE

(This post written by human)

Hi lewq, commentator of your post here.

Yeah, I used ChatGPT to help me write this answer ;) (Unlike JPEGs, it works at the right abstraction level for text.)

I think the core issue isn’t push vs pull or frame scheduling, but why you’re sending frames at all. Your use case reads much more like replicating textual/stateful UI than streaming video.

The fact that JPEG “works” because the client pulls frames on demand is kind of the tell — you’ve built a demand-driven protocol, then used it to fetch pixels. That avoids queuing, sure, but it’s also sidestepping video semantics you don’t actually need.

Most of what users care about here is text, cursor position, scroll state, and low interaction latency. JPEG succeeds not because it’s old and robust, but because it accidentally approximates an event-driven model.

Totally fair points about UDP + Kubernetes + enterprise ingress. But those same constraints apply just as well to structured state updates or terminal-style protocols over HTTPS — without dragging a framebuffer along.

Pragmatic solution, real struggle — but it feels like a text/state problem being forced through a video abstraction, and JPEG is just the least bad escape hatch.

— a human (mostly)

Hey lewq, 40Mbps is an absolutely ridiculous bitrate. For context, Twitch maxes out around 8.5Mb/s for 1440p60. Your encoder was poorly configured, that's it. Also, it sounds like your mostly static content would greatly benefit from VBR; you could get the bitrate down to 1Mb/s or something for screen sharing.

And yeah, the usual approach is to adapt your bitrate to network conditions, but it's also common to modify the frame rate. There's actually no requirement for a fixed frame rate with video codecs. It also you could do the same "encode on demand" approach with a codec like H.264, provided you're okay with it being low FPS on high RTT connections (poor Australians).

Overall, using keyframes only is a very bad idea. It's how the low quality animated GIFs used to work before they were secretly replaced with video files. Video codecs are extremely efficient because of delta encoding.

But I totally agree with ditching WebRTC. WebSockets + WebCodecs is fine provided you have a plan for bufferbloat (ex. adaptive bitrate, ABR, GoP skipping).

> Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames.

I understand that logic but I don't really agree with it. Very aggressive bitrate controls can do a lot to keep that buffer tiny while still looking better than JPEG, and if it bloats beyond 1-2 seconds you can reset. A reset like that wouldn't look notably worse than JPEG mode always looks.

If you use a video encoder that gives you good insight into what it's doing you could guarantee that the buffer never gets bigger than 1-2 JPEGs by dynamically deciding when to add frames. That would give you the huge benefits of P-frames with no downside.