← Back to context

Comment by minimaxir

4 days ago

Text encoder is Mistral-Small-3.2-24B-Instruct-2506 (which is multimodal) as opposed to the weird choice to use CLIP and T5 in the original FLUX, so that's a good start albeit kinda big for a model intended to be open weight. BFL likely should have held off the release until their Apache 2.0 distilled model was released in order to better differentiate from Nano Banana/Nano Banana Pro.

The pricing structure on the Pro variant is...weird:

> Input: We charge $0.015 for each megapixel on the input (i.e. reference images for editing)

> Output: The first megapixel is charged $0.03 and then each subsequent MP will be charged $0.015

> BFL likely should have held off the release until their Apache 2.0 distilled model was released in order to better differentiate from Nano Banana/Nano Banana Pro.

Qwen-Image-Edit-2511 is going to be released next week. And it will be Apache 2.0 licensed. I suspect that was one of the factors in the decision to release FLUX.2 this week.

> as opposed to the weird choice to use CLIP and T5 in the original FLUX

This method was used in tons of image generation models. Not saying it's superior or even a good idea, but it definitely wasn't "weird".

  • Considering how little (and sometimes negative) benefit it provided in most of them compared to just using the biggest encoder model and having a null prompt on the rest (not just those using the specific combination Flux.1 did, but for most of the multi-encoder models), its actually pretty weird that people kept doing it.

> as opposed to the weird choice to use CLIP and T5 in the original FLUX

Yeah, CLIP here was essentially useless. You can even completely zero the weights through which the CLIP input is ingested by the model and it barely changes anything.

Nice catch. Looks like engineers tried to take care of the GTM part as well and (surprise!) messed it up. In any case, the biggest loser here is Europe once again.