← Back to context

Comment by scottmf

1 month ago

I independently did the same with an MLX implementation on Sunday (also with Claude Code).

I expected this C implementation to be notably faster, but my M3 Max (36GB) could barely make it past the first denoising step before OOMing (at 512x512)

Am I doing something wrong? The MLX implementation takes ~1/sec per step with the same model and dimensions: https://x.com/scottinallcaps/status/2013187218718753032