← Back to context

Comment by yayr

4 years ago

then - how far away are we from having it on M1/M2 Macs, at least with regular processing? openvino may be one path I suppose: https://github.com/openvinotoolkit/openvino/issues/11554

I found this repo early on and have been using it to run inference on my M1 Pro MBP. https://github.com/ModeratePrawn/stable-diffusion-cpu

For me it runs at about 3.5 seconds per iteration per picture at 512x512.

There is also a fork that uses metal here and is much faster: https://github.com/magnusviri/stable-diffusion/tree/apple-si... but it doesn't support seeding the rng and will occasionally produce completely black output. Useful if you want to spit out a whole bunch of images for one prompt but you lose the ability to re-run a specific seed with a tweaked prompt or increased iterations.

  • > For me it runs at about 3.5 seconds per iteration per picture at 512x512.

    Wow that's impressively fast, I have a relatively recent Nvidia GPU that still takes 10 seconds. And the GPU is already almost as big as the entire macbook

I'm using the fork here: https://github.com/magnusviri/stable-diffusion.git (apple-silicon-mps-support branch).

Pretty easy to set up, though I had to take all the Homebrew stuff out of my environment before setting up the Conda environment (can also just export GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1, at least in my case).

Otherwise, I followed the normal steps to set things up, and I'm now here generating 1 image every 30 seconds at default settings. This is on a M1 Max MacBook Pro at 64GB RAM.

looks like there is an easier path using metal shaders: https://dev.to/craigmorten/setting-up-stable-diffusion-for-m...

and https://github.com/magnusviri/stable-diffusion/tree/apple-si...

  • I've been using this on my M1 Max and it works pretty well, 1.65 iterations per second (full precision, whereas my PC's 3080 can only do half-precision due to limited memory)... a 50-iteration image in about 40 seconds or so.

    • Your 3080 should be able to do full precision. Are you sure you don’t have the batch size set greater than 1, or another issue along those lines?

      1 reply →

    • > full precision, whereas my PC's 3080 can only do half-precision due to limited memory

      What model are you using? I've been running full-precision SD1.4 on my 3070, albeit with less than 10% VRAM headroom.

  • this worked fine for me, and running side by side with Intel CPU + nVidia 2070 it actually does not take much longer (and as a sibling said, seems to be working at full precision). It is one of the first things I've done that has properly made my M1 Max's fan spin up hard though!

PyTorch for m1 (https://pytorch.org/blog/introducing-accelerated-pytorch-tra... ) will not work: https://github.com/CompVis/stable-diffusion/issues/25 says "StableDiffusion is CPU-only on M1 Macs because not all the pytorch ops are implemented for Metal. Generating one image with 50 steps takes 4-5 minutes."

  • Yeah you can. Using the mps backend, just set PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU for unimplementeded ops. Takes a minute but it's mostly GPU accelerated.

  • By comparison I can generate 512x512 images every 15 seconds on an RTX 3080 (although there's an initial 30 second setup penalty for each run)

I got it working in about an hour on M1 ultra, mostly compiling things and having to tweak some model code to be compatible with metal. It works pretty well, about 1/10 to 1/20 of performance I can get on a 3080.