Comment by pksebben

3 months ago

You're right, of course. These models have deficiencies in their understanding related to the sophistication of the text encoder and it's relationship to the underlying tokenizer.

Which is exactly why the current discourse is about 'who does it best' (IMO, the flux series is top dog here. No one else currently strikes the proper balance between following style / composition / text rendering quite as well). That said, even flux is pretty tricky to prompt - it's really, really easy to step on your own toes here - for example, by giving conflicting(ish) prompts "The scene is shot from a high angle. We see the bottom of a passenger jet".

Talking to designers has the same problem. "I want a nice, clean logo of a distressed dog head. It should be sharp with a gritty feel". For the person defining the spec, they actually do have a vision that fits each criteria in some way, but it's unclear which parts apply to what.

0 comments

pksebben

No comments yet

Contribute on Hacker News ↗