Comment by 542458
4 days ago
> Run FLUX.2 [dev] on GeForce RTX GPUs for local experimentation with an optimized fp8 reference implementation of FLUX.2 [dev], created in collaboration with NVIDIA and ComfyUI.
Glad to see that they're sticking with open weights.
That said, Flux 1.x was 12B params, right? So this is about 3x as large plus a 24B text encoder (unless I'm misunderstanding), so it might be a significant challenge for local use. I'll be looking forward to the distill version.
Looking at the file sizes on the open weights version (https://huggingface.co/black-forest-labs/FLUX.2-dev/tree/mai...), the 24B text encoder is 48GB, the generation model itself is 64GB, which roughly tracks with it being the 32B parameters mentioned.
Downloading over 100GB of model weights is a tough sell for the local-only hobbyists.
100 GB is less than a game download, it's actually running it that's a tough sell. That said, the linked blog post seems to say the optimized model is both smaller and greatly improved the streaming approach from system RAM, so maybe it is actually reasonably usable on a single 4090/5090 type setup (I'm not at home to test).
Never mind the download size. Who has the VRAM to run it?
I do, 2x Strix Halo machines ready to go.
1 reply →
The download is a trivial onetime cost and so is storing it on a direct attached NVMe SSD. The expensive part is getting a GPU with 64GB of memory.
Even a 5090 can handle that. You have to use multiple GPUs.
So the only option will be [klein] on a single GPU... maybe? Since we don't have much information.
As far as I know, no open-weights image gen tech supports multi-GPU workflows except in the trivial sense that you can generate two images in parallel. The model either fits into the VRAM of a single card or it doesn’t. A 5ish-bit quantization of a 32Gw model would be usable by owners of 24GB cards, and very likely someone will create one.
> Even a 5090 can handle that. You have to use multiple GPUs.
It takes about 40GB with the fp8 version fully loaded, but ComfyUI can (at reduced speed), with enough system RAM available, partially load models in VRAM during inference and swap at need (the NVidia page linked in the BFL announcement specifically highlights NVidia working with ComfyUI to improve this existing capacity specifically to enable Flux.2) to run on systems with too little VRAM to fully load the model.