Yeah, there are still imperfections. But, it’s surprising to us how much the quality can be improved without the need of a whole pre-training (re-) run.
Is it possible (or do people already do this), to train a classifier to identify the AI look and use it as an adversary to try and maximise both 'quality' and 'not that sort of quality'?
I actually tried a few experiments in early exploration stages! I trained a small classifier to judge AI vs non-AI images. Use it as a reward model to do small RL / post training experiments. Sadly, was not too successful. We found that directly finetuning the model on high quality photorealistic image was most reliable.
Another note about preference optimisation and RL is that it has really high quality ceiling but needs to be very carefully tuned. It's easy to get perfect anatomy and structure if you decide to completely "collapse" the model. For instance, ChatGPT images are collapsed to have slight yellow color palette. FLUX images always have this glossy, plastic texture with overly blurry background. It's similar to reward hacking behavior you see in LLMs where they sound overly nice and chatty.
I had to make a few compromises to balance between "stable, collapsed, boring model" and "unstable, diverse, explorative" model.
Yeah, there are still imperfections. But, it’s surprising to us how much the quality can be improved without the need of a whole pre-training (re-) run.
Is it possible (or do people already do this), to train a classifier to identify the AI look and use it as an adversary to try and maximise both 'quality' and 'not that sort of quality'?
I actually tried a few experiments in early exploration stages! I trained a small classifier to judge AI vs non-AI images. Use it as a reward model to do small RL / post training experiments. Sadly, was not too successful. We found that directly finetuning the model on high quality photorealistic image was most reliable.
Another note about preference optimisation and RL is that it has really high quality ceiling but needs to be very carefully tuned. It's easy to get perfect anatomy and structure if you decide to completely "collapse" the model. For instance, ChatGPT images are collapsed to have slight yellow color palette. FLUX images always have this glossy, plastic texture with overly blurry background. It's similar to reward hacking behavior you see in LLMs where they sound overly nice and chatty.
I had to make a few compromises to balance between "stable, collapsed, boring model" and "unstable, diverse, explorative" model.
1 reply →