Comment by blintz
1 day ago
I was most surprised by the fact that it only took 40 examples for a Qwen finetune to match the style and quality of (interactively tuned) Nano Banana. Certainly the end result does not look like the stock output of open-source image generation models.
I wonder if for almost any bulk inference / generation task, it will generally be dramatically cheaper to (use fancy expensive model to generate examples, perhaps interactively with refinements) -> (fine tune smaller open-source model) -> (run bulk task).
In my experience image models are very "thirsty" and can often learn the overall style of an image from far fewer models. Even Qwen is a HUGE model relatively speaking.
Interestingly enough, the model could NOT learn how to reliably generate trees or water no matter how much data and/or strategies I threw at it...
This to me is the big failure mode of fine-tuning - it's practically impossible to understand what will work well and what won't and why
I see, yeah, I can see how if it's like 100% matching some parts of the style, but then failing completely on other parts, it's a huge pain to deal with. I wonder if a bigger model could loop here - like, have GPT 5.2 compare the fine-tune output and the Nano Banana output, notice that trees + water are bad, select more examples to fine-tune on, and the retry. Perhaps noticing that the trees and water are missing or bad is a more human judgement, though.
Interestingly enough even the big guns couldn't reliably act as judges. I think there are a few reasons for that:
- the way they represent image tokens isn't conducive to this kind of task
- text-to-image space is actually quite finicky, it's basically impossible to describe to the model what trees ought to look like and have them "get it"
- there's no reliable way to few-shot prompt these models for image tasks yet (!!)