← Back to context

Comment by harpiaharpyja

8 days ago

Not all models can actually do that if your prompt is particular

Most designers can't, either. Defining a spec is a skill.

It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.

  • … sure … but also no. For example, say I have an image. 3 people in it; there is a speech bubble above the person on the right that reads "I'A'T AY RO HERT YOU THE SAP!"¹

    I give it,

      Reposition the text bubble to be coming from the middle character.
    
      DO NOT modify the poses or features of the actual characters. 
    

    Now sure, specs are hard. Gemini removed the text bubble entirely. Whatever, let's just try again:

      Place a speech bubble on the image. The "tail" of the bubble should make it appear that the middle (red-headed) girl is talking. The speech bubble should read "Hide the vodka." Use a Comic Sans like font. DO NOT place the bubble on the right.
    
      DO NOT modify the characters in the image.
    

    There's only one red-head in the image; she's the middle character. We get a speech bubble, correctly positioned, but with a sans-serif, Arial-ish font, not Comic Sans. It reads "Hide the vokda" (sic). The facial expression of the middle character has changed.

    Yes, specs are hard. Defining a spec is hard. But Gemini struggles to follow the specification given. Whole sessions are like this, and absolute struggle to get basic directions followed.

    You can even see here that I & the author have started to learn the SHOUT AT IT rule. I suppose I should try more bulleted lists. Someone might learn, through experimentation "okay, the AI has these hidden idiosyncrasies that I can abuse to get what I want" but … that's not a good thing, that's just an undocumented API with a terrible UX.

    (¹because that is what the AI on a previous step generated. No, that's not what was asked for. I am astounded TFA generated an NYT logo for this reason.)

    • You're right, of course. These models have deficiencies in their understanding related to the sophistication of the text encoder and it's relationship to the underlying tokenizer.

      Which is exactly why the current discourse is about 'who does it best' (IMO, the flux series is top dog here. No one else currently strikes the proper balance between following style / composition / text rendering quite as well). That said, even flux is pretty tricky to prompt - it's really, really easy to step on your own toes here - for example, by giving conflicting(ish) prompts "The scene is shot from a high angle. We see the bottom of a passenger jet".

      Talking to designers has the same problem. "I want a nice, clean logo of a distressed dog head. It should be sharp with a gritty feel". For the person defining the spec, they actually do have a vision that fits each criteria in some way, but it's unclear which parts apply to what.

    • The NYT logo being rendered well makes sense because it's a logo, not a textual concept.

  • Yep, knowing how and what to ask is a skill.

    For anything, even back in the "classical" search days.

    • at least then, we had hard overrides that were actually hard.

      "This got searched verbatim, every time"

      W*ldcards were handy

      and so on...

      Now, you get a 'system prompt' which is a vague promise that no really this bit of text is special you can totally trust us (which inevitably dies, crushed under the weight of an extended context window).

      Unfortunately(?), I think this bug/feature has gotta be there. It's the price for the enormous flexibility. Frankly, I'd not be mad if we had less control - my guess is that in not too many years we're going to look back on RLHF and grimace at our draconian methods. Yeah, if you're only trying to build a "get the thing I intend done" machine I guess it's useful, but I think the real power in these models is in their propensity to expose you to new ideas and provide a tireless foil for all the half-baked concepts that would otherwise not get room to grow.