Comment by Bjorkbat
21 hours ago
Something I find weird about AI image generation models is that even though they no longer produce weird "artifacts" that give away that the fact that it was AI generated, you can still recognize that it's AI due to stylistic choices.
Not all examples they gave were like this. The example they gave of the word "Typography" would have fooled me as human-made. The infographics stood out though. I would have immediately noticed that the String of Turtles infographic was AI generated because of the stylistic choices. Same for the guide on how to make chai. I would be "suspicious" of the example they gave of the weather forecast but wouldn't immediately flag at as AI generated.
Similar note, earlier I was able to tell if something was AI generated right off the bat by noticing that it had a "Deviant Art" quality to it. My immediate guess is that certain sources of training data are over-represented.
We are just very sharp when it comes to seeing small differences in images.
I'm reminded of when the air force decided to create a pilot seat that worked for everyone. They took the average body dimensions of all their recruits and designed a seat to fit the average. It turned out, the seat fit none of their recruits. [1]
I think AI image generation is a lot like this. When you train on all images, you get to this weird sort of average space. AI images look like that, and we recognize it immediately. You can prompt or fine tune image models to get away from this, though -- the features are there it's a matter of getting them out. Lots of people trying stuff like this: https://www.reddit.com/r/StableDiffusion/comments/1euqwhr/re..., the results are nearly impossible to distinguish from real images.
[1] https://www.thestar.com/news/insight/when-u-s-air-force-disc...
What determines which “average” AI models latch onto? At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that. At a slightly higher level, the average of every image is the average of every subject every photographed or drawn (human, tree, house, plate of food, ...) in concept space; but AI still doesn't generate a human with branches or a house with spaghetti on it. At a still higher level there are things we recognize as sensible scenes, e.g., barista pouring a cup of coffee, anime scene of a guy fighting a robot, watercolor of a boat on a lake, which AI still does not (by default) average into, say, an equal parts watercolor/anime/photorealistic image of a barista fighting a robot on a boat while pouring a cup of coffee.
But it is undeniable that AI images do have an “average” feel to them. What causes this? What is the space over which AI is taking an average to produce its output? One possible answer is that a finite model size means that the model can only explore image space with a limited resolution, and as models get bigger/better they can average over a smaller and smaller portion of this space, but it is always limited.
But that raises the question of why models don't just naturally land on a point in image space. Is this just a limitation of training, which punishes big failures more strongly than it rewards perfection? Or is there something else at play here that's preventing models from landing directly on a “real” image?
> At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that.
That isn't correct since images in the real world aren't uniformly distributed from [0, 255] color-wise. Take, for example, the famous ImageNet normalization magic numbers:
If it were actually uniformly distributed, the mean for each channel would be 0.5 and the standard deviation would be 0.289. Also due to z-normalization, the "image" most image models see is not how humans typically see images.
The model "averages" in the latent space. That is in the space of packed image representations. I put "averages" into scare quotes, because I think it might be due to legal reasons. The model training might be organized in such a way as to push its default style away from styles of prominent artists. I might be wrong though.
Isn't the space you're talking about the input images that are close to the textual prompt?
These models are trained on image+text pairs. So if you prompt something like "an apple" you get a conceptual average of all images containing apples. Depending on your dataset, it's likely going to be a photograph of an apple in the center.
Tragedy of the aggregate.
[dead]
We can also pick up hints on discordant production value. This is quite noticeable on websites such as Amazon/Alibaba/Etsy/Ebay/etc where there's a lot of scam listings that use AI images for cheap or basic items.
So even though the image shown doesn't present obvious flaws, the fact that the image is high quality is the tell-tale sign of being AI generated.
This also isn't something that can be easily fixed - even if we produce convincing low production value imagery using AI, then the scam listing doesn't achieve its goal because it looks like junky crap.
I think it's because they're all trained on the same data (everything they could possibly scrape from the open web). The models tend to learn some kind of distribution of what is most likely for a given prompt. It tends to produce things that are very average looking, very "likely", but as a result also predictable and unoriginal.
If you want something that looks original, you have to come up with a more original prompt. Or we have to find a way to train these models to sample things that are less likely from their distribution? Find a way to mathematically describe what it means to be original.
An more original prompt wont fix things. Modern base models want to eliminate everything that puts their creators at risk, which is anything that is clearly made by someone else, more or less accurately reproducible. If you avoid decent representation of any artist style, or anything/anyone that is likely to go to court, you wont get the chance of an creative synthesis either.
Do you know of some tools with a parameter that asks it to be "weird" and increase diversity of outputs?
If you want a chance for real creativity, flexibility and you have a decent gpu go local. Check out comfyui, download models and play around. The mainstream services have zero knobs to play around with, local is infinite.
If you ever had a pinterest account and a deviant art account, all becomes clear.
It still has some artifacts more often than not, they are a lot subtler in nature but they still come out, whether it's texture, proportion, lighting, or perspective. Now some things are easier to fix on second pass edits, some are not. I guess it's why they consider image editing to be the next challenge.
The problem is how they are fine tuned with human feedbacks that are not opinionated, so they produce some "average taste" that is very recognizable. Early models didn't have this issue, it's a paradox... Lower quality / broken images but often more interesting. Krea & Black Forest did a blog post about that some time ago.
Oh yeah, funny enough even though I’m a bit of an AI art hater I actually thought very early Midjourney looked good because of all had an impressionistic, dreamy quality.
I wonder if we'll get to the point where we train different personalities into an image model that we can bring out in the prompt and these personalities have distinct art/picture styles they produce.
I don't think it's solely an data issue. Flux models for example are quite stylized, very notable with photorealism. But I think it was an deliberate choice to to have outputs that are absent of likeness and distinct style. I think it's an side effect that it washes away fine details and creates outputs feel artificial. The problem is that closed models can't be fixed easily, while models like flux or even older architectures can add back details and style with fine tuning and LoRas.
It's a bit odd to say, but another big clue identifying something as AI-generated is that it simply looks "too good" for what it is being used for. If I see a little info graphic demonstrating something relatively mundane, and it has nice 3D rendered characters or graphical elements, at this point it's basically guaranteed to be AI, because you just sort of intuitively know when something would've justified the human labor necessary to produce that.
Funny enough that had crossed my mind with the woodchuck example, because at a glance I can't see any weird artifacts, but I felt confident I could tell it was AI generated immediately if I saw it in the wild, and I couldn't really explain why. My immediate guess was "well, who the hell would actually bother to make something like this?"
It's not odd to say. It was one of the first telling signs to identify AI artists[0] on Twitter: overly detailed backgrounds.
Of course now a lot of them have learned the lesson and it's much harder to tell.
[0]: I know, I know...
Maybe the AI feeling is illusion because you already know it's AI-generated, just confirmation bias. Like wine tastes better after knowing it's expensive. In real world AI-generated images have passed Turing test. Only by double blind test do you can be really sure.