Comment by dzhang314
4 years ago
Hey everybody! I'm David, the creator of YouTubeDrive, and I never expected to see this old project pop up on HN. YouTubeDrive was created when I was a freshman in college with questionable programming abilities, absolutely no knowledge of coding theory, and way too much free time.
The encoding scheme that YouTubeDrive uses is brain-dead simple: pack three bits into each pixel of a sequence of 64x36 images (I only use RGB values 0 and 255, nothing in between), and then blow up these images by a factor of 20 to make a 1280x720 video. These 20x20 colored squares are big enough to reliably survive YouTube's compression algorithm (or at least they were in 2016 -- the algorithms have probably changed since). You really do need something around that size, because I discovered that YouTube's video compression would sometimes flip the average color of a 10x10 square from 0 to 255, or vice versa.
Looking back now as a grad student, I realize that there are much cleverer approaches to this problem: a better encoding scheme (discrete Fourier/cosine/wavelet transforms) would let me pack bits in the frequency domain instead of the spatial domain, reducing the probability of bit-flip errors, and a good error-correcting code (Hamming, Reed-Solomon, etc.) would let me tolerate a few bit-flips here and there. In classic academic fashion, I'll leave it as an exercise to the reader to implement these extensions :)
One more thing: the choice of Wolfram Mathematica as an implementation language was a deliberate decision on my part. Not for any technical reason -- YouTubeDrive doesn't use any of Mathematica's symbolic math capabilities -- but because I didn't want YouTubeDrive to be too easy for anybody on the internet to download and use, lest I attract unwanted attention from Google. In the eyes of my paranoid freshman self, the fact that YouTubeDrive is somewhat obtuse to install was a feature, not a bug.
So, feel free to have a look and have a laugh, but don't try to use YouTubeDrive for any serious purpose! This encoding scheme is so horrendously inefficient (on the order of 99% overhead) that the effective bandwidth to and from YouTube is something like one megabyte per minute.
As far back as the late 1970s a surprisingly similar scheme was used to record digital audio to analog video tape. It mostly looks like kind of stripey static, but there was a clear correlation between what happened musically and what happened visually, so in college (late 1980s) one of my friends came into one of these and we'd keep it on the TV while listening to whole albums. We had a simultaneous epiphany about the encoding scheme during a Jethro Tull flute solo, when the static suddenly became just a few large squares.
Can see one in action here
https://www.youtube.com/watch?v=TSpS_DiijxQ
Nice thanks, this answered my biggest question, which was "will it survive compression/re-encoding." (yes it will). Very cool idea!
Do you have any idea how many more bits you'd be able to use if you applied any of the encoding transformations?
I'd estimate that there's an easy order-of-magnitude improvement (~10x) just from implementing a simple error-correction mechanism -- a Reed-Solomon code ought to be good enough that we can take the squares down to 10x10, maybe even 8x8 or 5x5. Then, if we really work at it, we might be able to find another order-of-magnitude win (~100x) by packing more bits into a frequency-domain encoding scheme. This would likely require us to do some statistical analysis on the types of compression artifacts that YouTube introduces, in order to find a particularly robust set of basis images.
> that we can take the squares down to 10x10, maybe even 8x8 or 5x5
16x16, 8x8, or 4x4 would be the way to go. You'd want each RGB block to map to a single H.264 macroblock.
Using non order of 2 numbers means that individual blocks don't line up with macroblocks. Having a single macroblock represent 1, 4, or 16 RGB pixels would be ideal.
In fact, I bet modifying the original code to use a scaling factor of 16 instead of 20 would produce some significant improvements.
1 reply →
I'm not sure if your examples are sticking to 0 or 255 RGB. If it is you might get a win by using HSL to pick your colors. If you change the lightness dramatically every frame maybe colors won't bleed across a frame. Then perhaps you can encode 2+ bits in hue and another 2+ in saturation getting a win and another minor one using 1+ bit on brightness (ie first frame can be 0 or 25%, next frame is 75% or 100%. I'm not too familiar with encodings though and how much it'd interfere with the other transforms
YouTube's 1080p60 is already at a decimation ratio of about 200:1, then you have to consider how efficient P and B frames are with motion/differences. if your data looks like noise you're gonna be completely screwed since the P and B frames will absolutely destroy the quality.
There's a bunch of other things too, like YUV420p and TV colour range: 16-235, so you only get 7.7bits / pixel.
If anything you would want to encode your data in some way that abuses the P and B frames, and the macro block size of 16x16.
Coding theory for the data output at your end is only one side of the coin, the VP9 codec stupidly good compression is a completely different game to wrangle.
And I kinda doubt you'll get much better than your estimate of 1% from the original scheme.
https://www.youtube.com/watch?v=r6Rp-uo6HmI