Comment by blintz

17 days ago

I was most surprised by the fact that it only took 40 examples for a Qwen finetune to match the style and quality of (interactively tuned) Nano Banana. Certainly the end result does not look like the stock output of open-source image generation models.

I wonder if for almost any bulk inference / generation task, it will generally be dramatically cheaper to (use fancy expensive model to generate examples, perhaps interactively with refinements) -> (fine tune smaller open-source model) -> (run bulk task).

3 comments

blintz

cannoneyed 17 days ago

In my experience image models are very "thirsty" and can often learn the overall style of an image from far fewer models. Even Qwen is a HUGE model relatively speaking.

Interestingly enough, the model could NOT learn how to reliably generate trees or water no matter how much data and/or strategies I threw at it...

This to me is the big failure mode of fine-tuning - it's practically impossible to understand what will work well and what won't and why

blintz 17 days ago
I see, yeah, I can see how if it's like 100% matching some parts of the style, but then failing completely on other parts, it's a huge pain to deal with. I wonder if a bigger model could loop here - like, have GPT 5.2 compare the fine-tune output and the Nano Banana output, notice that trees + water are bad, select more examples to fine-tune on, and the retry. Perhaps noticing that the trees and water are missing or bad is a more human judgement, though.
- cannoneyed 17 days ago
  
  Interestingly enough even the big guns couldn't reliably act as judges. I think there are a few reasons for that:
  - the way they represent image tokens isn't conducive to this kind of task
  - text-to-image space is actually quite finicky, it's basically impossible to describe to the model what trees ought to look like and have them "get it"
  - there's no reliable way to few-shot prompt these models for image tasks yet (!!)