Comment by mlsu

20 hours ago

We are just very sharp when it comes to seeing small differences in images.

I'm reminded of when the air force decided to create a pilot seat that worked for everyone. They took the average body dimensions of all their recruits and designed a seat to fit the average. It turned out, the seat fit none of their recruits. [1]

I think AI image generation is a lot like this. When you train on all images, you get to this weird sort of average space. AI images look like that, and we recognize it immediately. You can prompt or fine tune image models to get away from this, though -- the features are there it's a matter of getting them out. Lots of people trying stuff like this: https://www.reddit.com/r/StableDiffusion/comments/1euqwhr/re..., the results are nearly impossible to distinguish from real images.

[1] https://www.thestar.com/news/insight/when-u-s-air-force-disc...

What determines which “average” AI models latch onto? At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that. At a slightly higher level, the average of every image is the average of every subject every photographed or drawn (human, tree, house, plate of food, ...) in concept space; but AI still doesn't generate a human with branches or a house with spaghetti on it. At a still higher level there are things we recognize as sensible scenes, e.g., barista pouring a cup of coffee, anime scene of a guy fighting a robot, watercolor of a boat on a lake, which AI still does not (by default) average into, say, an equal parts watercolor/anime/photorealistic image of a barista fighting a robot on a boat while pouring a cup of coffee.

But it is undeniable that AI images do have an “average” feel to them. What causes this? What is the space over which AI is taking an average to produce its output? One possible answer is that a finite model size means that the model can only explore image space with a limited resolution, and as models get bigger/better they can average over a smaller and smaller portion of this space, but it is always limited.

But that raises the question of why models don't just naturally land on a point in image space. Is this just a limitation of training, which punishes big failures more strongly than it rewards perfection? Or is there something else at play here that's preventing models from landing directly on a “real” image?

  • > At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that.

    That isn't correct since images in the real world aren't uniformly distributed from [0, 255] color-wise. Take, for example, the famous ImageNet normalization magic numbers:

        normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                         std=[0.229, 0.224, 0.225])
    

    If it were actually uniformly distributed, the mean for each channel would be 0.5 and the standard deviation would be 0.289. Also due to z-normalization, the "image" most image models see is not how humans typically see images.

  • The model "averages" in the latent space. That is in the space of packed image representations. I put "averages" into scare quotes, because I think it might be due to legal reasons. The model training might be organized in such a way as to push its default style away from styles of prominent artists. I might be wrong though.

  • Isn't the space you're talking about the input images that are close to the textual prompt?

    These models are trained on image+text pairs. So if you prompt something like "an apple" you get a conceptual average of all images containing apples. Depending on your dataset, it's likely going to be a photograph of an apple in the center.