Comment by vunderba

3 months ago

Updating the GenAI comparison website is starting to feel a bit Sisyphean with all the new models coming out lately, but the results are in for the Flux 2 Pro Editing model!

https://genai-showdown.specr.net/image-editing

It scored slightly higher than BFL's Kontext model, coming in around the middle of the pack at 6 / 12 points.

I’ll also be introducing an additional numerical metric soon, so we can add more nuance to how we evaluate model quality as they continue to improve.

If you're solely interested in seeing how Flux 2 Pro stacks up against the Nano Banana Pro, and another Black Forest model (Kontext), see here:

https://genai-showdown.specr.net/image-editing?models=km,nbp...

Note: It should be called out that BFL seems to support a more formalized JSON structure for more granular edits so I'm wondering if accuracy would improve using it.

25 comments

vunderba

woolion 3 months ago

The comparison are very useful but also quite limited in terms of styles. Models tend to have extremely diverse abilities in following a given style against steering to its own.

It's pretty obvious that OpenAI is terrible at it -- it is known for its unmissable touch. However, for Flux it really depends on the style. They already posted at some point that they changed their training to avoid averaging different styles together, which is the ultimate AI look. But this is at odds with the goal to directly generate images that are visually appealing, so the style matching is going to be a problem for a while, at least.

vunderba 3 months ago
The site is broken up into "Editing Comparison" and a "Generative Comparison" sections.
Generative: https://genai-showdown.specr.net
Editing: https://genai-showdown.specr.net/image-editing
Style is mostly irrelevant for editing, since the goal is to integrate seamlessly with the existing image. The focus is on performing relatively surgical edits or modifications to existing imagery while minimizing changes to the rest of the image. It is also primarily concerned with realism, though there are some illustrative examples (the JAWS poster, Great Wave off Kanagawa).
This contrasts with the generative section though even then the emphasis is on prompt adherence, and style/fidelity take a backseat (which is honestly what 99% of existing generative benchmarks already focus on).
- woolion 3 months ago
  
  Oh, thank you for your reply. We may have different definitions of style and what editing would mean.
  If you look for example at "Mermaid Disciplinary Committee", every single image is in a very different style, each that you can consider a default of what the model assume would be for the specific prompt. It's quite obvious that these styles were 'baked in' the models, and it's not clear how much you can steer in a specific style. If you look at "The Yarrctic Circle", a lot more models default to a kind of "generic concept art" style (the "by greg rutkowski" meme) but even then I would classify the results as at least 5 distinct styles. So for me this benchmark is not checking style at all, unless you consider style to be just around 4 categories (cartoon, anime, realistic, painterly).
  So regarding image editing, I did my own tests at the first release of Flux tools, and found that it was almost impossible to get any decent results on some specific styles, specifically cartoon and concept art styles. I think the tools focus on what imaginary marketing people would want (like "put this can of sugary beverage into an idyllic scene") rather than such use cases. So editing like "color this" or other changes would just be terrible, and certainly unusable.
- woolion 3 months ago
  
  I didn't go very far with my own benchmarks because my results were just so bad. But for example, here's a line art with the instruction to color it (I can't remember the prompt, I didn't take notes).
  https://woolion.art/assets/img/ai/ai_editing.webp
  It's original, ChatGPT, Flux.
  Still, you can see that ChatGPT just throw everything out and does not do a minimal attempt at respecting style. Flux is quite bad, but it follows the design much more (although it gets completely confused by it) that it seems that with a whole lot of work you could get something out of it.
  
  3 replies →

echelon 3 months ago

How much energy does BFL have to keep playing this game against Google and ByteDance (SeeDream)?

If their new fancy model is only middle of the pack, and they're not as open source as the Chinese Qwen image models (or ByteDance / Alibaba / Lightricks video models), what's the point?

It's not just prompt adherence, the image quality of Flux models has been pretty bad. Plastic skin, inhumanely chiseled chins, that general faux "AI" aura.

Indeed, the Flux samples in your test suite that "pass" look God-awful. It might "pass" from a technical standpoint, but there's no way I'd choose Flux to solve my workflows. It looks bad.

(I wonder if they lack people on their data team with good aesthetic taste. It may be as simple as that.)

I think this company is struggling. They're pinned between Google and the Chinese. It's a tough, unenviable spot to be in.

I think a lot of the foundation model companies in media are having a really hard time: RunwayML, PikaLabs, LumaLabs. Some of them have pivoted hard away from solving media for everyone. I don't think they can beat the deep-pocketed hyperscalers or the Chinese ecosystem.

BFL just raised a massive round, so what do I know? I just can't help but feel that even though Runway raised similar money, they're struggling really hard now. And I would really not want to be fighting against Google who is already ahead in the game.

latentspacer 3 months ago
i may be wrong, but it doesn't seem like BFL is struggling to me. they were apparently founded in august 2024, and have already signed $100M+ revenue deals with customers like meta (https://www.bloomberg.com/news/articles/2025-09-09/meta-to-p...)
in fact, it seems like BFL has benefited a lot by becoming the go-to alternative for big enterprise customers who don't want to be dependent on google
- echelon 3 months ago
  
  Wow, I didn't hear about this. That's impressive, and kudos to the team.
  That's why they raised the massive round, then.
  But this just leads to more questions - I have to wonder if and for how long this is just going to be to plug in a gap for Meta's own AI product offering. At some point they'll want to build their own in-house models or perhaps just acquire BFL. Zuckerberg would not be printing AI data centers if that wasn't the case.
  From a PLG standpoint, Flux isn't really what graphics designers are choosing for their work. The generations look worse than OpenAI's "piss filter". But aesthetics might not be the play the team is going after.
  Hopefully they don't just raise all of this dry powder energy and burn it trying to race Google. They should start listening to designers and get in their good graces if their intent is to build tools for art and graphics design work.
  A good press release would consist of lots of good looking images and a video of workflows that save artists time. This press release doesn't connect with graphics designers at all and it reads as if they aren't even the audience.
  If it's something else, more "enterprise", that BFL is after, then maybe I don't know the strategy or game plan.
  
  3 replies →
- Bombthecat 3 months ago
  
  The contract is still going / will be going on in 2026?
vunderba 3 months ago

Sadly, I tend to agree. I'm rooting for BFL, but the results from this latest model (the Pro version, of all things) have just been a bit disappointing. Google’s release of NB Pro last week certainly didn’t help either, since it set the bar so incredibly high.
Flux 2 Pro only scored a single point higher than the Kontext models they released over half a year ago.
The text-to-image side was even more frustrating. It often felt like it was actively fighting me, as evidenced by the high number of re-rolls required before it passed some of the tests (Cubed⁵, for example).

spaceman_2020 3 months ago

Clearly Google is winning this by some margin

Seedream is also very good and makes me think the next version will challenge Google for SOTA image gen

Increasingly feels like image gen is a solved problem

raxxorraxor 3 months ago
I think the margin isn't that large to be honest. If we compare available resources and data it is quite tiny and perhaps should be larger.
Also it doesn't feel solved to me at all. There is no general model, perhaps it cannot reasonably exist. I think these tests are benchmarks are smart, but they don't show the whole picture.
Domain specific image generation tasks still require a domain specific models. For art purposes SD1.5 with specialized and finely tuned checkpoints will still provide the best results by far. It is also limited, but I think it dampened the hype for new image generators significantly.
- spunker540 3 months ago
  
  Does SD1.5 suffer from resolution / coherence / complexity issues?
  I understand most outputs could be fine tuned for most domains, but still felt sd1.5 had a resolution ceiling, and a complexity ceiling no matter how good the fine tuning
  
  2 replies →
ttul 3 months ago

Prompt understanding will only ever be as good as the language embeddings that are fed into the model’s input. Google’s hardware can host massive models that will never be run on your desktop GPU. By contrast, Flux and its kin have to make do with relatively tiny LLMs (Qwen Image uses a 7B-param LLM).

xnx 3 months ago

> starting to feel a bit Sisyphean with all the new models coming out lately

You jinxed yourself: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

arresin 3 months ago

Hey I hope you see this. The scoring needs to be a 0-10 or something with a range rather than pass or fail. Flux one getting the same score for the surfer as Gemini pro 3 reduces the quality of the benchmark.

vunderba 3 months ago

Hi bn-l, yeah as mentioned above and in the Release Notes - we'll be adding a more nuanced numerical score in the next week.
I don't know if I'm going to get as granular as 1-10 only because the finer the scoring - the more potential for subjectivity. That's why it was initially set up as a "Minimum Passing Criteria Rule Set" along with a Pass/Fail grade.
A suggestion from a previous HN post was something along the lines of (0 Fail, 0.5 Technical Pass, 1.0 Proficient Pass).

sroussey 3 months ago

On the site: s/sttae/state/g