Comment by IvanK_net
4 years ago
Fun fact: USB 2.0 webcams have been existing for over 10 years. USB 2.0 is 60 MB/s.
A pixel of an image is 3 Bytes. A 1920x1080 FullHD image is 6.2 MB. At 30 frames per second, second of a FullHD video is 186 MB. How did they do that?
Answer: frames are transferred as JPEG files. Even a cheap $15 webcam is a tiny computer (with a CPU, RAM, etc), which runs a JPEG encoder program.
Most webcams, especially 10 years ago are not 1080p, or even 60fps. Many aren't even 720p. 1280 x 720 x 3 bytes x 30 fps = ~80MB/second. 480p @ 30 fps = 26MB. That is how many webcams can get by without hardware JPEG/H264 encoding.
4K @ 60fps = 1.4GB/sec. USB 3, even with 2 lanes, will have trouble with that.
The cheap ones are using hardware JPEG encoders. The associated micro isn't powerful enough to do it in firmware alone.
Surprised they don't use a hardware video encoder, is it because the well and efficiently supported formats are all MPEG, and thus have fairly high licensing cost on top of the hardware? Or because even efficient HVEs use more resources than webcams can afford? Or because inter-frame coding requires more storage, which (again) means higher costs, which (again) eats into the margin, which cheap webcam manufacturers consider not worth the investment?
My older Logitech C920 has an on-board H.264 encoder. Newer revisions of the same model does not.
I haven't figured out why they chose to remove it, but your point about licensing cost combined with them not advertising it much as a feature, and most of their competitors not including "proper" video encoding might explain it.
Edit: Found an official explanation here: https://www.logitech.com/en-us/video-collaboration/resources... TLDR, they figure most computers at that point had HW encoders.
2 replies →
MJPEG is just a very simple "video" format that needs very simple and cheap electronics to work. Video encoding blocks are mostly part of bigger SoCs and comes with licensing costs.
Same goes on the other hand for the receiving end - decoding a stream of JPEGs is just much simpler in both CPU use and code complexity than dealing with something like H.264.
Hm. But then wouldn't it make more sense to just stream the raw sensor data, which is 1 byte per pixel (or up to 12 bits if you want to get fancy), and then demosaic it on the host? Full HD at 30 fps would be 59.33 MB/s, barely but still fitting into that limit.
But then also I think some webcams use H264? I remember reading that somewhere.
The pixel density doesn't generally refer to the density of the Bayer pattern, which can be even denser. Generally a cluster of four Bayer pixels makes up one pixel (RG/GB), but like most things in computing, the cognitive complexity is borderline fractal and this is a massive simplification.
> Full HD at 30 fps would be 59.33 MB/s, barely but still fitting into that limit.
It's not fitting into anything I fear, best case scenario the effective bulk transfer rate of USB2 is 53MB/s.
60 is the signaling rate, but that doesn't account for the framing or the packet overhead.
It would need a funny driver and since that stuff is big parallel image processing it's easy in HW but if someone has a netbook or cheap/old Celeron crap it would peg their CPU to do the demosaic and color correction.
> Full HD at 30 fps would be 59.33 MB/s, barely but still fitting into that limit.
That limit is too high even as a theoretical max.
You could do raw 720p.
I don't know where you get "1 byte per pixel" from. At minimum, raw 4:2:0 video would be two bytes per pixel, and RGB would be three bytes per pixel with 8-bit color depth.
You're talking about processed color frames. The GP was suggesting that the camera stream the raw sensor data, which doesn't have individual color channels, just a monochrome grid with 10 or 12 bits of usable data per pixel. A bayer filter[0] is placed in front of the sensor so that a given color of light falls on each cell. The USB host would be responsible for applying a demosaicing[0] algorithm to create the color channels from the raw sensor data.
If we take the AR0330 sensor used in the USB Camera C1[2] as an example, it has a native resolution of 2304H x 1296V and outputs 10 bits per native pixel after internal A-Law compression[3] for a total raw frame size of 3.56 MiB, assuming optimal packing. The corresponding image, demosaiced and downscaled to Full HD (1920x1080), in RGB with eight bits per channel would be 5.93 MiB.
[0] https://en.wikipedia.org/wiki/Bayer_filter
[1] https://en.wikipedia.org/wiki/Demosaicing
[2] https://www.kurokesu.com/shop/cameras/CAMUSB1
[3] https://www.onsemi.com/products/sensors/image-sensors/ar0330
2 replies →
When talking about digital cameras, each "pixel" is a single color sensor. Blame marketing.
Also 4:2:0 is 6 values per 4 pixels. 1.5 bytes per pixel at 8-bit.
It needs a uC with some special hardware anyways to do demosaic or else it would require special drivers that would peg some people's crappy laptop CPUs.
Also the raw YUV 4:2:0 is 1.5 bytes per pixel so that's doing half of the "compression" work for you.