Comment by meindnoch

21 days ago

1. Create a point cloud from a scene (either via lidar, or via photogrammetry from multiple images)

2. Replace each point of the point cloud with a fuzzy ellipsoid, that has a bunch of parameters for its position + size + orientation + view-dependent color (via spherical harmonics up to some low order)

3. If you render these ellipsoids using a differentiable renderer, then you can subtract the resulting image from the ground truth (i.e. your original photos), and calculate the partial derivatives of the error with respect to each of the millions of ellipsoid parameters that you fed into the renderer.

4. Now you can run gradient descent using the differentiable renderer, which makes your fuzzy ellipsoids converge to something closely reproducing the ground truth images (from multiple angles).

5. Since the ellipsoids started at the 3D point cloud's positions, the 3D structure of the scene will likely be preserved during gradient descent, thus the resulting scene will support novel camera angles with plausible-looking results.

You... you must have been quite some 5 year old.

  • ELI5 has meant friendly simplified explanations (not responses aimed at literal five-year-olds) since forever, at least on the subreddit where the concept originated.

    Now, perhaps referring to differentiability isn't layperson-accessible, but this is HN after all. I found it to be the perfect degree of simplification personally.

  • Some things would be literally impossible to properly explain to a 5 year old.

    • If one actually tried to explain to a five year old, they can use things like analogy, simile, metaphor, and other forms of rhetoric. This was just a straight-up technical explanation.

  • Lol. Def not for 5 year olds but it's about exactly what I needed

    How about this:

    Take a lot of pictures of a scene from different angles, do some crazy math, and then you can later pretend to zoom and pan the camera around however you want

Thanks.

How hard is it to handle cases where the starting positions of ellipsoids in 3D is not correct (being too off). How common is such a scenario with the state of the art? E.g., if having only a stereoscopic image pair, the correspondences are often not accurate.

Thanks.

I assume that the differentiable renderer is only given its position and viewing angle at any one time (in order to be able to generalize to new viewing angles)?

Is it a fully connected NN?

  • No. There are no neural networks here. The renderer is just a function that takes a bunch of ellipsoid parameters and outputs a bunch of pixels. You render the scene, then subtract the ground truth pixels from the result, and sum the squared differences to get the total error. Then you ask the question "how would the error change if the X position of ellipsoid #1 was changed slightly?" (then repeat for all ellipsoid parameters, not just the X position, and all ellipsoids, not just ellipsoid #1). In other words, compute the partial derivative of the error with respect to each ellipsoid parameter. This gives you a gradient, that you can use to adjust the ellipsoids to decrease the error (i.e. get closer to the ground truth image).