← Back to context

Comment by spaceman_2020

3 days ago

Clearly Google is winning this by some margin

Seedream is also very good and makes me think the next version will challenge Google for SOTA image gen

Increasingly feels like image gen is a solved problem

I think the margin isn't that large to be honest. If we compare available resources and data it is quite tiny and perhaps should be larger.

Also it doesn't feel solved to me at all. There is no general model, perhaps it cannot reasonably exist. I think these tests are benchmarks are smart, but they don't show the whole picture.

Domain specific image generation tasks still require a domain specific models. For art purposes SD1.5 with specialized and finely tuned checkpoints will still provide the best results by far. It is also limited, but I think it dampened the hype for new image generators significantly.

  • Does SD1.5 suffer from resolution / coherence / complexity issues?

    I understand most outputs could be fine tuned for most domains, but still felt sd1.5 had a resolution ceiling, and a complexity ceiling no matter how good the fine tuning

    • Yeah SD 1.5 is mostly trained on datasets of resolution of 512x512. That's why you'd get crazy multi-limb goro abominations if you pushed checkpoints too much higher than 768x768 without either using a Hires Fix or Img2Img.

      There's not much of a reason to use SD 1.5 over SDXL if image quality is paramount.

      A lot of people (myself included) use a pipeline that involves using Flux to get the basic action / image correct, then SDXL as a refiner and finally a decent NMKD-based upscaler.

    • Yes, the toolchains around it can alleviate it, but only to a degree. You more or less dependent on a fine tune specifically trained for the things you want. But if you have that, the image quality is usually far better than from any generic model in my opinion, aside from resolution.

      Merging any or all concepts is mostly beyond it, but I haven't seen any model being good at it yet. There are some that are significantly better, but often come with other disadvantages.

      Overall what these models can do is quite impressive. But if you want a really high quality image, finding the fitting model is as difficult as finding the right prompt. And the general models tend to always fall back to some mean AI standard image.

Prompt understanding will only ever be as good as the language embeddings that are fed into the model’s input. Google’s hardware can host massive models that will never be run on your desktop GPU. By contrast, Flux and its kin have to make do with relatively tiny LLMs (Qwen Image uses a 7B-param LLM).