Comment by andriamanitra
1 year ago
From reading the code it looks like it's just taking a screenshot (.jpg) and sending it once a second. Does doing it that way actually save on bandwidth compared to modern video compression (that re-use information from previous frames)?
I recorded a one minute video clip of me editing some code in VS Code (1440p 10fps, using AV1 encoding) and it was about half the size of 60 JPEG screenshots of the same screen. I would be curious to see your numbers if you've done any tests.
Ten years ago I was experimenting with TimeSnapper Classic, a free utility for Windows that takes a screenshot every 5 seconds. The neat feature is it lets you view a timelapse of your day.
The screenshots were taking up a lot of disk space. I noticed that very little changed between pictures, so I started thinking of an algorithm that would make use of this characteristic: store only the changes between subsequent images. A few minutes in, I realized I was reinventing video compression!
So, I just used ffmpeg to turn the image sequence into a mp4. It was something like 95% reduction in file size. (I think I used ImageMagick to embed the timestamp filename into the images themselves, thus recreating basically all the features of TimeSnapper Classic with 2 commands.)
It's not only a matter of bandwidth, but a matter of CPU utilization. I've tried to feed screenshots to ffmpeg and other tools, and it's just... unusable. It works, but consumes way too much resources. At least on my computer (MacBook 13-inch, 2019).
So from on side you have CPU utilization, from the other - network. Network is cheap, but encoding is expensive. This is my thinking at least. I don't have proof - only local experiments, but it's a really good idea to start measuring this.
I also have other ideas in mind on how to scan the screen and send only parts of the screen that have been updated. Probably if I send only a half of the screen, it will beat the video encoding in terms of network. The diff algo should be very fast though, since we're dealing (in case of 1280x720) with 1280x720=914400, 914400*4 = 3.49 MB info processing in 1 second.
Also, curious to hear about video encoding efficiency vs 60x JPEG creation. Is it comparable?
> I've tried to feed screenshots to ffmpeg and other tools, and it's just... unusable. It works, but consumes way too much resources.
Did you try to use the hardware encoder? Modern computers have chips to accelerate/offload video encode/decode. Your 2019 Mac has Intel GPU for H.264 and HEVC hw encoder, also it has an T2 co-processor that can also encode HEVC video. If you don't supply specific encoders (with _videotoolbox suffix on Mac) via -c:v then ffmpeg will default to sw encoder, which consumes CPU.
> how to scan the screen and send only parts of the screen that have been updated
You'll be reinventing video codecs with interframe compression.
> Also, curious to hear about video encoding efficiency vs 60x JPEG creation. Is it comparable?
I see that you are comparing pixel by pixel for each image to dedupe and also resizing the image to 1280px. Also the image has to be encoded to JPEG. All of the above are done in CPU. In essense you implemented Motion JPEG. Below is a command to allow you to evaluate a more effecient ffmpeg setup.
ffmpeg \
-f avfoundation -i "<screen device index>:<audio device index>" \ # specific to mac
-an \ # no audio
-c:v h264_videotoolbox \ # macos h.264 hardware encoder -r 1 \ # 1fps
-vf scale=1920:-1 \ # 1920px wide
-b:v 2M \ # 2Mbps bitrate for clear and legible text
out.mp4 # you may want to setup a RTMP server so that ffmpeg can transmit the encoded video stream to and allow visitors to view the video
Keep in mind most codecs can be tuned. Live encoding is a very different use case than encoding a video file you only need later. Most codecs have knobs you can turn to make it have lower latency&cpu in exchange for somewhat larger file sizes.
I co-built a similar screen sharing app (with a web server seeing traffic in the middle though) many years ago. MPEG isn't a good fit. We tried JPEG but it didn't look great. Version 1 decomposed the screen into regions, looked at which regions changed, and transmitted PNGs.
The second version tried to use the VNC approach developed years ago by AT&T and open sourced. The open source implementation glitched just a bit much for my liking. Various companies white labelled VNC in those days; not sure they fed back all their fixes. But the raw VNC protocol has a lot of good compression ideas specific to the screen sharing domain and documented in papers. People also tunneled VNC over SSH. I jerryrigged an HTTPS tunnel of sorts.
After a while I started to suspect if I wanted to get higher frame rates I should use a more modern screen sharing oriented Microsoft-based/specific codec. But it wasn't my skillet so I never actually went down that route. But I'd recommend researching using others screen-optimized lossless codecs, open or closed, so you don't reinvent the wheel on that side of things if you're serious about reducing bandwidth.
If you are serious about reducing bandwidth, why would you stick to lossless codecs?
3 replies →
This was my first thought too, there's no reason not to use a standard codec, just configured to run at 1fps.
depends if you need low latency
modern codec do motion prediction and don't work with with low frame rate out of the box.
Seems like is preventing data persistence (replace, delete) was chosen over minimize bandwidth (no optimization)
But could easily do both if you wanted to - though I’m not sure it’s worth the hassle. I agree that this might struggle if used at scale on the same IP
Not only that. JPEG works best on natural-looking images, with gradients, curves, constant and wide color variation, etc. Computer screens very often show entirety different kinds of images, dominated by few flat colors, small details (like text) and sharp edges. That is, exactly by "high-frequency noise" JPEG is built to throw away.
JPEG either makes "smeared" screenshots or low-compression screenshots. PNG often works better.
A proper video codec mostly sends the small changes between frames (including shifts,like scrolling), and relatively rare key frames. It could give both a better visual quality and better bandwidth usage.
What's interesting in the "screenshot per second" solution is that it can be hacked together from common existing pieces, like imagemagic, netcat, and bash; no need to install anything. (Imagine you've got privilege-limited access to a remote box, and maybe cannot even write to disk! Oh wait...)
The problem with the JPEG vs. PNG debate for screenshots, is that screenshots can contain anything from photos to text to UI elements to frames of video.
Just open any website and you'll see text right beside photos, or text against a photographic backdrop, often in the middle of being moved around with hardware-accelerated CSS animations.
I think we need an image container format that can use different compression algorithms for different regions or "layers" of the image, and an encoder that quickly detects how to slice up a screenshot into arbitrary layers. Both should be possible with modern tech. I just hope the resulting format isn't patent-encumbered.
7 replies →