FLUX.2 [Klein]: Towards Interactive Visual Intelligence

1 day ago (bfl.ai)

I haven’t gotten around to adding Klein to my GenAI Showdown site yet, but if it’s anything like Z-Image Turbo, it should perform extremely well.

For reference, Z-Image Turbo scored 4 out of 15 points on GenAI Showdown. I’m aware that doesn’t sound like much, but given that one of the largest models, Flux.2 (32b), only managed to outscore ZiT (a 6b model) by a single point and is significantly heavier-weight, that’s still damn impressive.

Local model comparisons only:

https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt

  • Can you fix the information bubble on mobile please? When pressing one, it vanishes instantly...

    • Hey Bombthecat, sorry about that! I can't repro this issue on any of the devices I have (Android Pixel 7, an iPad, etc).

      If you get a chance, could you list your mobile device specs? That way I can at least try it on Browserstack and see if I can figure out a fix.

      2 replies →

  • I think it shows problems with your tests tbh. The bigger models are way more capable than you make them out to be. They are also better in training and understanding of CGI render outputs as reference like normal maps or id-masks. Your testing suite is the perfect example that structured data implies false confidence. Pure t2i is not a good benchmark anymore.

    • Thanks for the feedback.

      > The bigger models are way more capable than you make them out to be.

      No test suite is ever going to be perfect. GenAI Showdown was started with the goal of focusing on a very narrow spectrum of testing (prompt adherence) because as a creator that's the one of the most interest to me.

      > Pure t2i is not a good benchmark anymore

      Just FYI Image Editing is already a separate benchmark (see the navbar at the top).

      > Your testing suite is the perfect example that structured data implies false confidence

      Again - the headline is "Specific prompts and challenges with a strong emphasis placed on adherence". If I tried to capture every possible aspect of GenAI models (multimodal, texture maps, periodic motion, tiling, etc) - I'd be at it until the heat death of the universe.

      Incidentally - which model (specifically) do you think is ranked unfairly? While Flux.2 [dev] did only score a single point above ZiT, it's weighted score is much higher (1442 points vs 911 points).

I am amazed, though not entirely surprised, that these models keep getting smaller while the quality and effectiveness increases. z image turbo is wild, I'm looking forward to trying this one out.

An older thread on this has a lot of comments: https://news.ycombinator.com/item?id=46046916

  • There are probably some more subtle tipping points that small models hit too. One of the challenges of a 100GB model is that there is non-trivial difficulty in downloading and running the thing that a 4GB model doesn't face. At 4GB I think it might be reasonable to assume that most devs can just try it and see what it does.

  • Quality is increasing, but these small models have very little knowledge compared to their big brothers (Qwen Image/Full size Flux 2). As in characters, artists, specific items, etc.

    • Agreed - given what Tongyi-MAI Lab was able to accomplish with a 6b model - I would love to see what they could do with something larger. Somewhere in the range of 15-20b, between these smaller models (ZiT, Klein) and the significantly larger models (Flux.2 dev).

    • I smell the bias-variance tradeoff. By underfitting more, they get closer to the degenerate case of a model that only knows one perfect photo.

  • Is there a theoritical minimum for params for a given output? I saw news about GPT 3.5, then Deepseek training models at a fraction of that cost, then laptops running a model that beats 3.5. When does it stop?

It cannot create an image of a pogo stick.

I was trying to get it to create an image of a tiger jumping on a pogo stick, which is way beyond its capabilities, but it cannot create an image of a pogo stick in isolation.

  • When given an image of an empty wine glass, it can't fill it to the brim with wine. The pogo stick drawers and wine glass fillers can enjoy their job security for months to come!

    • You can still taste wine in the metaverse with the mouth adapter and can get a buzz by gently electrifying your neuralink (time travel required)

  • It's a tough test for local models - (gpt-image and NB had zero problems) - the only one that came reasonably close was Qwen-Image

    Z-Image / Flux 2 / Hidream / Omnigen2 / Qwen Samples:

    https://imgur.com/a/tB6YUSu

    This is where smaller models are just going to be more constrained and will require additional prompting to coax out the physical description of a "pogo stick". I had similar issues when generating Alexander the Great leading a charge on a hippity-hop / space hopper.

  • You are right, just tried even with reference images it can't do it for me. Maybe with some good prompting.

    Because in theory I would say that knowledge is something that does not have to be baked in the model but could be added using reference images if the model is capable enough to reason about them.

> FLUX.2 [klein] 4B The fastest variant in the Klein family. Built for interactive applications, real-time previews, and latency-critical production use cases.

I wonder what kind of use cases could be "latency-critical production use cases"?

If we think of GenAI models as a compression implementation. Generally, text compresses extremely well. Images and video do not. Yet state-of-the-art text-to-image and text-to-video models are often much smaller (in parameter count) than large language models like Llama-3. Maybe vision models are small because we’re not actually compressing very much of the visual world. The training data covers a narrow, human-biased manifold of common scenes, objects, and styles. The combinatorial space of visual reality remains largely unexplored. I am looking towards what else is out there outside of the human-biased manifold.

  • > Generally, text compresses extremely well. Images and video do not.

    Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.

    • The comparison makes sense in what I am charitably assuming is the case the GP is referring to: we know how to build a tight embedding space from a text corpus, and get out outputs from it tolerably similar to the inputs for the purposes they're put to. That is lossy compression, just not in the sense anyone talking about conventional lossless text compression algorithms would use the words. I'm not sure we can say the same of image embeddings.

  • Images and video compress vastly better than text. You're lucky to get 4:1 to 6:1 compression of text (1), while the best perceptual codecs for static images are typically visually lossless at 10:1 and still look great at 20:1 or higher. Video compression is much better still due to temporal coherence.

    1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.

  • I find it likely that we are still missing a few major efficiency tricks with LLMs. But I would also not underestimate the amount of implicit knowledge and skill an LLM is expected to carry on a meta level.

I appreciate that they released a smaller version that is actually open source. It creates a lot more opportunities when you do not need a massive budget just to run the software. The speed improvements look pretty significant as well.

Flux2 Klein isn’t some generation leap or anything. It’s good, but let’s be honest, this is an ad.

What will be really interesting to me is the release of Z-image, if that goes the way it’s looking, it’ll be natural language SDXL 2.0, which seems to be what people really want.

Releasing the Turbo/Distilled/Finetune months ago was a genius move really. It hurt Flux and Qwen releases on a possible future implication alone.

If this was intentional, I can’t think of the last time I saw such shrewd marketing.

  • The team behind Z-Image Turbo has told us multiple times in their paper that the output quality of the Turbo model is superior to the larger base model.

    I think that information still did not get through to most users.

    "Notably, the resulting distilled model not only matches the original multi-step teacher but even surpasses it in terms of photorealism and visual impact."

    "It achieves 8-step inference that is not only indistinguishable from the 100-step teacher but frequently surpasses it in perceived quality and aesthetic appeal"

    https://arxiv.org/abs/2511.22699

  • I’m a bit confused, both you and another commenter mention something called Z-Image, presumably another Flux model?

    Your frame of it is speculative, i.e. it is forthcoming. Theirs is present tense. Could I trouble you to give us plebes some more context? :)

    ex. Parsed as is, and avoiding the general confusion if you’re unfamiliar, it is unclear how one can observe “the way it is looking”, especially if turbo was released months ago and there is some other model that is unreleased. Chose to bother you because the others comment was less focused on lab on lab strategy.