← Back to context

Comment by Vachyas

17 hours ago

I'm honestly unsure what could be improved at this point.

Consistency? So it fails less often?

Based on the released images, (especially the one "screenshot" of the Mac desktop) I feel like the best images from this model are so visually flawless that the only way to tell they're fake is by reasoning about the content of the image itself (ex. "Apple never made a red iPhone 15, so this image is probably fake" or "Costco prices never end in .96 so this image is probably fake")

There is definitely room for improvement: https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a...

Especially when it comes to detailed outputs or non-standard prompts.

I do believe it will get even better - not sure it will happen within a year but I wouldn't be incredibly surprised if it did.

  • Yep. “Where’s Waldo” has been a classic challenge for generative models for a while because it requires understanding the entire concept (there’s only one Waldo), while also holding up to scrutiny when you examine any individual, ordinary figure.

    I experimented with the concept of procedural generation of Waldo-style scavenger images with Flux models with rather disappointing results. (unsurprisingly).

  • That's a good example, actually.

    If you asked me what I expected, since this one has "thinking", it'd be that it would've thought to do something like generate the image without Waldo first, then insert Waldo somewhere into that image as an "edit"

  • I wonder if at this point you could just ask the agent to iteratively refine the image in smaller portions.

I'm been impressed when testing this model today, but it still can't consistently adhere to the following prompt: make me an image of a pizza split into 10 equal slices with space in between the them, to help teach fractions to a child.

It doesn't reliably give you 10 slices, even if you ask it to number them. None of the frontier models seem to be able to get this right

> I'm honestly unsure what could be improved at this point.

That's because you're focusing a little bit too much on visual fidelity. It's still relatively trivial to create a moderately complex prompt and have it fail miserably.

Even SOTA models only scored a 12 out of 15 on my benchmarks, and that was without me deliberately trying to "flex" to break the model.

Here's one I just came up with:

  A Mercator projection of earth where the land/oceans are inverted. (aka land = ocean, and oceans = land)