Comment by pleurotus

22 days ago

Super cool to read but can someone eli5 what Gaussian splatting is (and/or radiance fields?) specifically to how the article talks about it finally being "mature enough"? What's changed that this is now possible?

37 comments

pleurotus

meindnoch 22 days ago

1. Create a point cloud from a scene (either via lidar, or via photogrammetry from multiple images)

2. Replace each point of the point cloud with a fuzzy ellipsoid, that has a bunch of parameters for its position + size + orientation + view-dependent color (via spherical harmonics up to some low order)

3. If you render these ellipsoids using a differentiable renderer, then you can subtract the resulting image from the ground truth (i.e. your original photos), and calculate the partial derivatives of the error with respect to each of the millions of ellipsoid parameters that you fed into the renderer.

4. Now you can run gradient descent using the differentiable renderer, which makes your fuzzy ellipsoids converge to something closely reproducing the ground truth images (from multiple angles).

5. Since the ellipsoids started at the 3D point cloud's positions, the 3D structure of the scene will likely be preserved during gradient descent, thus the resulting scene will support novel camera angles with plausible-looking results.

klondike_klive 22 days ago
You... you must have been quite some 5 year old.
- efskap 22 days ago
  
  ELI5 has meant friendly simplified explanations (not responses aimed at literal five-year-olds) since forever, at least on the subreddit where the concept originated.
  Now, perhaps referring to differentiability isn't layperson-accessible, but this is HN after all. I found it to be the perfect degree of simplification personally.
- SchemaLoad 22 days ago
  
  Some things would be literally impossible to properly explain to a 5 year old.
  
  1 reply →
- np_tedious 22 days ago
  
  Lol. Def not for 5 year olds but it's about exactly what I needed
  How about this:
  Take a lot of pictures of a scene from different angles, do some crazy math, and then you can later pretend to zoom and pan the camera around however you want
  
  4 replies →
alok-g 22 days ago

Thanks.
How hard is it to handle cases where the starting positions of ellipsoids in 3D is not correct (being too off). How common is such a scenario with the state of the art? E.g., if having only a stereoscopic image pair, the correspondences are often not accurate.
Thanks.
make3 21 days ago
I assume that the differentiable renderer is only given its position and viewing angle at any one time (in order to be able to generalize to new viewing angles)?
Is it a fully connected NN?
- meindnoch 21 days ago
  
  No. There are no neural networks here. The renderer is just a function that takes a bunch of ellipsoid parameters and outputs a bunch of pixels. You render the scene, then subtract the ground truth pixels from the result, and sum the squared differences to get the total error. Then you ask the question "how would the error change if the X position of ellipsoid #1 was changed slightly?" (then repeat for all ellipsoid parameters, not just the X position, and all ellipsoids, not just ellipsoid #1). In other words, compute the partial derivative of the error with respect to each ellipsoid parameter. This gives you a gradient, that you can use to adjust the ellipsoids to decrease the error (i.e. get closer to the ground truth image).
renewiltord 22 days ago

Great explanation/simplification. Top quality contribution.
cpt_sobel 22 days ago
And what about the "mature enough" part? How has it changed / progressed recently?
- corysama 21 days ago
  
  The field is advancing rapidly. New research papers are being published daily for a few years now. The best news feed I've found on the topic is
  https://radiancefields.com/
  https://x.com/RadianceFields alt: https://xcancel.com/RadianceFields
pleurotus 22 days ago

Thanks for the explanation!
chrisjj 22 days ago

Or: Matrix bullet time with more viewpoints and less quality.

tel 22 days ago

Gaussian splatting is a way to record 3-dimensional video. You capture a scene from many angles simultaneously and then combine all of those into a single representation. Ideally, that representation is good enough that you can then, post-production, simulate camera angles you didn't originally record.

For example, the camera orbits around the performers in this music video are difficult to imagine in real space. Even if you could pull it off using robotic motion control arms, it would require that the entire choreography is fixed in place before filming. This video clearly takes advantage of being able to direct whatever camera motion the artist wanted in the 3d virtual space of the final composed scene.

To do this, the representation needs to estimate the radiance field, i.e. the amount and color of light visible at every point in your 3d volume, viewed from every angle. It's not possible to do this at high resolution by breaking that space up into voxels, those scale badly, O(n^3). You could attempt to guess at some mesh geometry and paint textures on to it compatible with the camera views, but that's difficult to automate.

Gaussian splatting estimates these radiance fields by assuming that the radiance is build from millions of fuzzy, colored balls positioned, stretched, and rotated in space. These are the Gaussian splats.

Once you have that representation, constructing a novel camera angle is as simple as positioning and angling your virtual camera and then recording the colors and positions of all the splats that are visible.

It turns out that this approach is pretty amenable to techniques similar to modern deep learning. You basically train the positions/shapes/rotations of the splats via gradient descent. It's mostly been explored in research labs but lately production-oriented tools have been built for popular 3d motion graphics tools like Houdini, making it more available.

pleurotus 22 days ago
Thanks for the explanation! It makes a lot of sense that voxels would scale as badly as they do, especially if you want to increase resolution. Am I right in assuming that the reason this scales a lot better is because the Gaussian splats, once there's enough "resolution" of them, can provide the estimates for how light works reasonably well at most distances? What I'm getting at is, if I can see Gaussian splats vs voxels similarly to pixels vs vector graphics in images?
- tel 22 days ago
  
  I think, yes, with greater splat density—and, critically, more and better inputs to train on, others have stated that these performances were captured with 56 RealSense D455fs—then splats will more accurately estimate light at more angles and distances. I think it's likely that during capture they had to make some choices about lighting and bake those in, so you might still run into issues matching lighting to your shots, but still.
  https://www.realsenseai.com/products/real-sense-depth-camera...
  That said, I don't think splats:voxels as pixels:vector graphics. Maybe a closer analogy would be pixels:vectors is the same as voxels:3d mesh modeling. You might imagine a sophisticated animated character being created and then animated using motion capture techniques.
  But notice where these things fall apart, too. SVG shines when it's not just estimating the true form, but literally is it (fonts, simplified graphics made from simple strokes). If you try to estimate a photo using SVG it tends to get messy. Similar problems arise when reconstructing a 3d mesh from real-world data.
  I agree that splats are a bit like pixels, though. They're samples of color and light in 3d (2d) space. They represent the source more faithfully when they're more densely sampled.
  The difference is that a splat is sampled irregularly, just where it's needed within the scene. That makes it more efficient at representing most useful 3d scenes (i.e., ones where there are a few subjects and objects in mostly empty space). It just uses data where that data has an impact.
MITSardine 22 days ago
Are meshes not used instead of gaussian splats only due to robustness reasons? I.e., if there were a piece of software that could reliably turn a colored point cloud into a textured mesh, would that be preferable?
- corysama 21 days ago
  
  Photogrammetry has been around for a long time now. It uses pretty much the same inputs to create meshes from a collection of images of a scene.
  It works well for what it does. But, it's mostly only effective for opaque, diffuse, solid surfaces. It can't handle transparency, reflection or "fuzz". Capturing material response is possible, but requires expensive setups.
  A scene like this poodle https://superspl.at/view?id=6d4b84d3 or this bee https://superspl.at/view?id=cf6ac78e would be pretty much impossible with photogrammetry and very difficult with manual, traditional, polygon workflows. Those are not videos. Spin them around.
- dahart 21 days ago
  
  It’s not only for robustness. Splats are volumetric and don’t have topology constraints, and both of those things are desirable. The volume capability is sometimes used for volume effects like fog and clouds, but it also gives splats a very graceful way to handle high frequency geometry - higher frequency detail than the capture resolution - that mesh photogrammetry can’t handle (hair, fur, grass, foliage, cloth, etc.). It depends entirely on the resolution of the capture, of course. I’m not saying meshes can’t be used to model hair or other fine details, they can obviously, but in practice you will never get a decent mesh out of, say, iPhone headshots, while splats will work and capture hair pretty well. There are hair-specific capture methods that are decent, but no general mesh capture methods that’ll do hair and foliage and helicopters and buildings.
  BTW I believe there is software that can turn point clouds into textured meshes reliably; multiple techniques even, depending on what your goals are.
- baxuz 22 days ago
  
  Not everything can be represented by textured meshes used in traditional photogrammetry (think Google Street View)
  This includes sparse areas like fences, vegetation and the likes, but more importantly any material properties like reflections, specularity, opacity, etc.
  Here's a few great examples: https://superspl.at/view?id=cf6ac78e
  https://superspl.at/view?id=c67edb74
cubefox 22 days ago
> Gaussian splatting is a way to record 3-dimensional video.
I would say it's a 3D photo, not a 3D video. But there are already extensions to dynamic scenes with movement.
- poly2it 22 days ago
  
  See 4D splatting.
  
  1 reply →

dmarcos 22 days ago

It’s a point cloud where each point is a semitransparent blob that can have a view dependent color: color changes depending on direction you look at them. Allowing to capture reflections, iridescence…

You generate the point clouds from multiple images of a scene or an object and some machine learning magic

KerrickStaley 22 days ago

This 2-minute video is a great intro to the topic https://www.youtube.com/watch?v=HVv_IQKlafQ

I think this tech has become "production-ready" recently due to a combination of research progress (the seminal paper was published in 2023 https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/) and improvements to differentiable programming libraries (e.g. PyTorch) and GPU hardware.

djeastm 22 days ago

For the ELI5, Gaussian splatting represents the scene as millions of tiny, blurry colored blobs in 3D space and renders by quickly "splatting" them onto the screen, making it much faster than computing an image by querying a neural net model like radiance fields.

I'm not up on how things have changed recently

krackers 22 days ago

https://aras-p.info/blog/2023/09/05/Gaussian-Splatting-is-pr... and for a visual demo of the result https://antimatter15.com/splat/

ravedave5 22 days ago

This is a REALLY good video explaining it. https://www.youtube.com/watch?v=eekCQQYwlgA

rkuykendall-com 22 days ago

I found this VFX breakdown of the recent Superman movie to have a great explanation of what it is and what it makes possible: https://youtu.be/eyAVWH61R8E?t=232

tl;dr eli5: Instead of capturing spots of color as they would appear to a camera, they capture spots of color and where they exist in the world. By combining multiple cameras doing this, you can make a 3D works from footage that you can then zoom a virtual camera round.

michaelrubloff 22 days ago

I also spoke to the vfx team from Superman on how they achieved the reconstructions! (I’m also the author for the Helicopter article here).
https://radiancefields.com/gaussian-splatting-in-superman