Comment by cannoneyed
15 days ago
In my experience image models are very "thirsty" and can often learn the overall style of an image from far fewer models. Even Qwen is a HUGE model relatively speaking.
Interestingly enough, the model could NOT learn how to reliably generate trees or water no matter how much data and/or strategies I threw at it...
This to me is the big failure mode of fine-tuning - it's practically impossible to understand what will work well and what won't and why
I see, yeah, I can see how if it's like 100% matching some parts of the style, but then failing completely on other parts, it's a huge pain to deal with. I wonder if a bigger model could loop here - like, have GPT 5.2 compare the fine-tune output and the Nano Banana output, notice that trees + water are bad, select more examples to fine-tune on, and the retry. Perhaps noticing that the trees and water are missing or bad is a more human judgement, though.
Interestingly enough even the big guns couldn't reliably act as judges. I think there are a few reasons for that:
- the way they represent image tokens isn't conducive to this kind of task
- text-to-image space is actually quite finicky, it's basically impossible to describe to the model what trees ought to look like and have them "get it"
- there's no reliable way to few-shot prompt these models for image tasks yet (!!)