← Back to context

Comment by minimaxir

4 days ago

Looking at the file sizes on the open weights version (https://huggingface.co/black-forest-labs/FLUX.2-dev/tree/mai...), the 24B text encoder is 48GB, the generation model itself is 64GB, which roughly tracks with it being the 32B parameters mentioned.

Downloading over 100GB of model weights is a tough sell for the local-only hobbyists.

100 GB is less than a game download, it's actually running it that's a tough sell. That said, the linked blog post seems to say the optimized model is both smaller and greatly improved the streaming approach from system RAM, so maybe it is actually reasonably usable on a single 4090/5090 type setup (I'm not at home to test).

Never mind the download size. Who has the VRAM to run it?

  • I do, 2x Strix Halo machines ready to go.

    • (Fellow Strix Halo owner): I don't really like calling it VRAM any more than when a dGPU dynamically maps a portion of system RAM. It's really just a system with quad channel RAM speeds attached to a GPU without VRAM - nearly 2x identical in performance to using the system RAM on my 2 channel desktop instead of actual VRAM on the dGPU in the system (which is something like 20x).

      That's great, and I love the little laptop for the amount of x86 perf it can pack into so little cooling, but my used Epyc box of ~the same price is usually faster for AI (despite the complete lack of video card) and able to load models 3x the size (well, before RAM prices doubled this last month) because it has modular 12 channel RAM and memory speeds this low don't really need a GPU to keep up with the matrix math. Meanwhile, Flux is already slow when it's on actual real high bandwidth dedicated GPU memory VRAM.

The download is a trivial onetime cost and so is storing it on a direct attached NVMe SSD. The expensive part is getting a GPU with 64GB of memory.

Even a 5090 can handle that. You have to use multiple GPUs.

So the only option will be [klein] on a single GPU... maybe? Since we don't have much information.

  • As far as I know, no open-weights image gen tech supports multi-GPU workflows except in the trivial sense that you can generate two images in parallel. The model either fits into the VRAM of a single card or it doesn’t. A 5ish-bit quantization of a 32Gw model would be usable by owners of 24GB cards, and very likely someone will create one.

  • > Even a 5090 can handle that. You have to use multiple GPUs.

    It takes about 40GB with the fp8 version fully loaded, but ComfyUI can (at reduced speed), with enough system RAM available, partially load models in VRAM during inference and swap at need (the NVidia page linked in the BFL announcement specifically highlights NVidia working with ComfyUI to improve this existing capacity specifically to enable Flux.2) to run on systems with too little VRAM to fully load the model.