Comment by philipkglass
1 day ago
The differently styled images of "astronaut riding a horse" are great, but that has been a go-to example for image generation models for a while now. The introduction says that they train on 37 million real and synthetic images. Are astronauts riding horses now represented in the training data more than would have been possible 5 years ago?
If it's possible to get good, generalizable results from such (relatively) small data sets, I'd like to see what this approach can do if trained exclusively on non-synthetic permissively licensed inputs. It might be possible to make a good "free of any possible future legal challenges" image generator just from public domain content.
Yes.
Though I'm a bit confused why this became the goto. If I remember correctly the claim was about it being "out of distribution" but I have high confidence that astronauts riding horses are within the training dataset prior to DALL-E. The big reason everyone should believe this is because astronauts have always been compared to cowboys. And... what do we stereotypically associate with cowboys?
The second reason, is because it is the main poster for the 2006 movie The Astronaut Farmer: https://en.wikipedia.org/wiki/The_Astronaut_Farmer
But here's some other ones I found that are timestamped. It's kinda hard to find random digital art that is timestamped. Looks like even shutterstock doesn't... And places like deviantart don't have great search. Hell... even Google will just flat out ignore advanced search terms (the fuck is even the point of having them?). The term is so littered now that this makes search difficult, but I found two relatively quickly.
2014: https://www.behance.net/gallery/18695387/Space-Cowboy#
2016: https://drawception.com/game/DZgKzhbrhq/badass-space-cowboy-...
But even if the samples did not exist, I do not think this represents a significantly out of distribution, if at all, image. Are we in doubt that there's images like astronauts riding rockets? I think certainly there exists "astronaut riding horse" along the interpolation between "person riding horse" and "astronaut riding <insert any term>". Mind you, generating samples in distribution but not in training (or test) is still a great feat and impressive accomplishment. This should in no way be underplayed at all! But there is a difference in claiming out of distribution.
One minor point. The term "synthetically generated" is a bit ambiguous. It may include digital art. It does not necessarily mean generated by a machine learning generative model. TBH, I find the ambiguity frustrating as there is some important distinctions.
The original meme about the limitations of diffusion was the text to image prompt, “a horse riding an astronaut.”
It’s in all sorts of papers. This guy Gary Marcus used to be a big crank about AI limitations and was the “being wrong on the Internet” guy who got a lot of mainstream attention to the problem - https://garymarcus.substack.com/p/horse-rides-astronaut. Not sure how much we hear from him nowadays.
The astronaut riding horses thing is from how 10-1,000x more people are doing this stuff now, and kind of process the whole zeitgeist before their arrival with fuzzy glasses. The irony is it is the human, not the generator, that got confused about the purposefully out of sample horse riding an astronaut prompt, and changed it to astronaut riding a horse.
I was under the impression that the astronaut riding the horse was in use prior to Marcus's tweet. Even that substack post has him complaining about how Technology Review is acting as a PR firm for OpenAI. That article shows an astronaut riding a horse. I mean that image was in the announcement blog post[0]
Certainly Marcus's tweets played a role in the popularity of the image, but I'm not convinced this is the single causal root.
[0] https://openai.com/index/dall-e-2/
This whole “Horse riding an astronaut” was a bit dumb in the first place, because AFAIK CLIP (the text encoder used in first-generation diffusion models) doesn't really distinguish the two in the first place. (So fundamentally that Marcus guy was right, the tech employed was fundamentally unable to do what he asked of to do)
> The irony is it is the human, not the generator, that got confused about the purposefully out of sample horse riding an astronaut prompt, and changed it to astronaut riding a horse.
You're mixing things up: "astronaut ridding a horse" was used by OpenAI their Dall-E 2 announcement blog post, ”horse ridding an astronaut" only came after, and had a much more niche audience anyway, so it's absolutely not an instance of “humans got caught by an out of sample instance and misremembered”.
It wasn't because it was "out of distribution" (although that's a reasonable assumption and it is at least _somewhat_ out of distribution, given the scarcity of your examples).
Like the avocado armchair before it, the real reason was simply that it "looked cool". It scratched some particular itch for people.
For me, indeed it's correlated with "imagination". An avocado armchair output had a particular blending of concepts that matched (in my mind) the way humans blend concepts. With the "astronaut riding a horse on the moon", you are hitting a little of that; but also you're effectively addressing criticism about text-to-image models with a prompt that serves as an evaluation for a couple of things:
1.) t2i is bad at people (astronaut)
2.) t2i struggles with animal legs (horse)
3.) t2i struggles with costumes, commonly putting the spacesuit on both the astronaut _and_ the horse - and mangling that in the process (and usually ruining any sense of good artistic aesthetics).
4.) t2i commonly gets confused with the moon specifically, frequently creating a moon _landscape_ but also doing something silly like putting "another" moon in the "night sky" as well.
There are probably other things. And of course this is subjective. But as someone who followed these things as they happened, which was I believe the release of DALL-E 2 and the first Stable Diffusion models, this is why I thought it was a good evaluation (at the time).
edit: I truly despise HN comment's formatting rules.
This isn't what "out of distribution" means. There can be ZERO images and it wouldn't mean something is OOD. OOD means not within the underlying distribution. That's why I made the whole point about interpolation.
Is it scarce? Hard to tell. But I wouldn't make that assumption based on my examples. Search is poisoned now.
I think there's a lot of things that are assumed scarce which are not. There's entire datasets that are spoiled because people haven't bothered to check.
2 replies →