Comment by what
12 hours ago
Shouldn’t every model get the same prompt? Seems a bit weird, especially when you can’t see the prompts that were used.
12 hours ago
Shouldn’t every model get the same prompt? Seems a bit weird, especially when you can’t see the prompts that were used.
The goal isn’t the prompt itself. The test is whether a prompt can be expressed in such a way that we still arrive at the author's intent, and of course to do so in a way that isn't unnatural.
The prompts despite their variation are still expressed in natural language.
The idea is that if you can rephrase the prompt and still get the desired outcome, then the model demonstrates a kind of understanding; however more variation attempts also get correspondingly penalized: this is treated more as a failure of steering, not of raw capability.
An example might help - take the Alexander the Great on a Hippity-Hop test case.
The starter prompt is this: "A historical oil painting of Alexander the Great riding a hippity-hop toy into battle."
If a model fails this a couple of times (multiple seeds), we might use a synonym for a hippity-hop, it was also known as a space hopper.
Still failing? We might try to describe the basic physical appearance of a hippity-hop.
Thus, something like GPT-Image-2 scored much higher on the compliance component of the test, requiring only a single attempt, compared with Z-Image Turbo, which required 14 attempts.