Comment by sebastiennight

1 year ago

He's diffing the frames, and then the only pixels that stay the same are the UI, from which he doesn't directly get the UI (see the example, it's illegible) but he can extract the POSITION of the UI on the screen by finding all the non-red pixels.

And then he does a good ol' regular crop on the original image to get the UI excerpt to feed the vision model.