Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for setup and running on Mac Silicon chips #25

Open
crsrusl opened this issue Aug 16, 2022 · 365 comments
Open

Instructions for setup and running on Mac Silicon chips #25

crsrusl opened this issue Aug 16, 2022 · 365 comments

Comments

@crsrusl
Copy link

crsrusl commented Aug 16, 2022

Hi,

I’ve heard it is possible to run Stable-Diffusion on Mac Silicon (albeit slowly), would be good to include basic setup and instructions to do this.

Thanks,
Chris

@thelamedia
Copy link

I heard that pytorch updated to include apple MDS in the latest nightly release as well. Will this improve performance on M1 devices by utilizing Metal?

@mja
Copy link

mja commented Aug 20, 2022

With Homebrew

brew install python3@3.10
pip3 install torch torchvision
pip3 install setuptools_rust
pip3 install -U git+https://github.com/huggingface/diffusers.git
pip3 install transformers scipy ftfy

Then start python3 and follow the instructions for using diffusers.

StableDiffusion is CPU-only on M1 Macs because not all the pytorch ops are implemented for Metal. Generating one image with 50 steps takes 4-5 minutes.

@frenchie1980
Copy link

Hi @mja,

thanks for these steps. I can get as far as the last one but then installing transformers fails with this error. (the install of setuptools_rust was successful )

      running build_ext
      running build_rust
      error: can't find Rust compiler

      If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.

      To update pip, run:

          pip install --upgrade pip

      and then retry package installation.

      If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

for context the first step failed to install python3.10 with brew so I did it with Conda instead. Not sure if having a full anaconda env installed is the problem

@filipux
Copy link

filipux commented Aug 20, 2022

Just tried the pytorch nighly build with mps support and have some good news.

On my cpu (M1 Max) it runs very slow, almost 9 minutes per image, but with mps enabled it's ~18x faster: less than 30 seconds per image🤩

@thelamedia
Copy link

Incredible! Would you mind sharing your exact setup so I can duplicate on my end?

@filipux
Copy link

filipux commented Aug 20, 2022

Unfortunately I got it working by many hours of trial and error, and in the end I don't know what worked. I'm not even a programmer, I'm just really good at googling stuff.

Basically my process was:

  • install pytorch nightly
  • update osx (12.3 required, mine was at 12.1)
  • use a conda environment, I could not get it to work without it
  • install missing packages using either pip or conda (one of them usually works)
  • go through every file and changetorch.device("cpu"/"cuda")to torch.device("mps")
  • in register_buffer() in ddim.py, change to attr = attr.to(torch.device("mps"), torch.float32)
  • in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...
  • in terminal write export PYTORCH_ENABLE_MPS_FALLBACK=1
  • + dozens of other things I have forgotten about...

I'm sorry that I can't be more helpful than this.

@thelamedia
Copy link

Thanks. What are you currently using for checkpoints? Are you using research weights or are you using another model for now?

@magnusviri
Copy link

I don't have access to the model so I haven't tested it, but based off of what @filipux said, I created this pull request to add mps support. If you can't wait for them to merge it you can clone my fork and switch to the apple-silicon-mps-support branch and try it out. Just follow the normal instructions but instead of running conda env create -f environment.yaml, run conda env create -f environment-mac.yaml. I think the only other requirement is that you have to have macOS 12.3 or greater.

Raymonf added a commit to Raymonf/pytorch that referenced this issue Aug 21, 2022
@einanao
Copy link

einanao commented Aug 21, 2022

I couldn't quite get your fork to work @magnusviri, but based on most of @filipux's suggestions, I was able to install and generate samples on my M2 machine using https://github.com/einanao/stable-diffusion/tree/apple-silicon

@Raymonf
Copy link

Raymonf commented Aug 21, 2022

Edit: If you're looking at this comment now, you probably shouldn't follow this. Apparently a lot can change in 2 weeks!

Old comment

I got it to work fully natively without the CPU fallback, sort of. The way I did things is ugly since I prioritized making it work. I can't comment on speeds but my assumption is that using only the native MPS backend is faster?

I used the mps_master branch from kulinseth/pytorch as a base, since it contains an implementation for aten::index.Tensor_out that appears to work from what I can tell: https://github.com/Raymonf/pytorch/tree/mps_master

If you want to use my ugly changes, you'll have to compile PyTorch from scratch as I couldn't get the CPU fallback to work:

# clone the modified mps_master branch
git clone --recursive -b mps_master https://github.com/Raymonf/pytorch.git pytorch_mps && cd pytorch_mps

# dependencies to build (including for distributed)
# slightly modified from the docs
conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses pkg-config libuv

# build pytorch with explicit USE_DISTRIBUTED=1
USE_DISTRIBUTED=1 MACOSX_DEPLOYMENT_TARGET=12.4 CC=clang CXX=clang++ python setup.py install

I based my version of the Stable Diffusion code on the code from PR #47's branch, you can find my fork here: https://github.com/Raymonf/stable-diffusion/tree/apple-silicon-mps-support

Just your typical pip install -e . should work for this, there's nothing too special going on here, it's just not what I'd call upstream-quality code by any means. I have only tested txt2img, but I did try to modify knn2img and img2img too.

Edit: It definitely takes more than 20 seconds per image at the default settings with either sampler, not sure if I did something wrong. Might be hitting pytorch/pytorch#77799 :(


@magnusviri: You are free to take anything from my branch for yourself if it's helpful at all, thanks for the PR 😃

@magnusviri
Copy link

@Raymonf: I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that @einanao made as well. The only difference I could see was in ldm/models/diffusion/plms.py

einanao:

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device("cuda"):
                attr = attr.type(torch.float32).to(torch.device("mps")).contiguous()

Raymonf:

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device(self.device_available):
                attr = attr.to(torch.float32).to(torch.device(self.device_available))

I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu.

@einanao
Copy link

einanao commented Aug 21, 2022 via email

@Raymonf
Copy link

Raymonf commented Aug 21, 2022

@einanao Maybe not! How long does yours take to run the default seed and prompt with full precision? GNU time reports 4.52.5 minutes with the fans at 100% on a 16 inch M1 Max, which is way longer than 20 seconds. I'm curious if you using the CPU fallback for some parts makes it faster at all.

@einanao
Copy link

einanao commented Aug 21, 2022

It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022

@thelamedia
Copy link

I'm getting this error when trying to run with the laion400 data set:

return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Is this an issue with the torch functional.py script?

@einanao
Copy link

einanao commented Aug 22, 2022

Yes, see @filipux's earlier comment:

in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...

@thelamedia
Copy link

thelamedia commented Aug 22, 2022

@einanao thank you. One step closer, but now I'm getting this:
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.mps.enabled) AttributeError: module 'torch.backends.mps' has no attribute 'enabled'

Here is my function:

def layer_norm( input: Tensor, normalized_shape: List[int], weight: Optional[Tensor] = None, bias: Optional[Tensor] = None, eps: float = 1e-5, ) -> Tensor:

if has_torch_function_variadic(input, weight, bias):
    return handle_torch_function(
        layer_norm, (input.contiguous(), weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
    )
return torch.layer_norm(input.contiguous(), normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)`
@byhringo
Copy link

It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022

For benchmarking purposes - I'm at ~150s (2.5 minutes) on each iteration past the first, which was over 500s after setting up with the steps in these comments.

14" 2021 Macbook Pro with base specs. (M1 Pro chip)

@Automatt
Copy link

This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).

@henrique-galimberti
Copy link

This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).

What steps did you follow?
I tried three apple forks but they all are taking 1h to generate using the sample command (python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms)
I'm using pytorch nightly btw.

@Automatt
Copy link

Automatt commented Aug 22, 2022

@henrique-galimberti I followed these steps:

  • Install PyTorch nightly
  • Used this branch referenced above from @magnusviri
  • Modify functional.py as noted above here to resolve view size not compatible issue
@recurrence
Copy link

recurrence commented Aug 22, 2022

mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis

@recurrence
Copy link

Looks like there's a ticket for the reshape error at pytorch/pytorch#80800

@magnusviri
Copy link

mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis

Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master. The last commit was 15 hours ago. But master has commits kind of non-stop for the last 9 hours.

@pnodseth
Copy link

@henrique-galimberti I followed these steps:

  • Install PyTorch nightly
  • Used this branch referenced above from @magnusviri
  • Modify functional.py as noted above here to resolve view size not compatible issue

Where can I find the functional.py file ?

@cgodley
Copy link

cgodley commented Aug 22, 2022

Where can I find the functional.py file ?

import torch
torch.__file__

For me the path is below. Your path will be different.

'/Users/lab/.local/share/virtualenvs/lab-2cY4ojCF/lib/python3.10/site-packages/torch/__init__.py'

Then replace __init__.py with nn/functional.py

@henrique-galimberti
Copy link

henrique-galimberti commented Aug 22, 2022

I change conda env to use rosetta and it is faster than before, but still waaaay too slow:
image

@recurrence
Copy link

Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master.

It was merged 5 days ago so it should be in the regular PyTorch nightly that you can get directly from the PyTorch site.

@cgodley
Copy link

cgodley commented Aug 22, 2022

@henrique-galimberti I followed these steps:

* Install PyTorch nightly

* Used [this branch](https://github.com/CompVis/stable-diffusion/pull/47) referenced above from @magnusviri

* Modify functional.py as noted above [here](https://github.com/CompVis/stable-diffusion/issues/25#issuecomment-1221667017) to resolve view size not compatible issue

I also followed these steps and confirmed MPS was being used (printed the return value of get_device()) but it's taking about 31.74s/it, which seems very slow.

  • macOS 12.5
  • MacBook Pro M1 14" base model (16GB of memory, 14 GPU cores)
@AntonEssenetial
Copy link

You may have installed Conda for Apple Silicon. You need Conda for Intel. Download this: https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.pkg And install it normally. Quit Terminal and start it again, and see if that fixed the problem.

i've installed miniconda from repo.
Skipped this step: # install packages
PIP_EXISTS_ACTION=w CONDA_SUBDIR=osx-arm64 conda env create -f environment-mac.yaml
conda activate ldm

Then try to: python scripts/dream.py --full_precision

python scripts/dream.py --full_precision
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 12, in
import ldm.dream.readline
ModuleNotFoundError: No module named 'ldm'

@hawtdawg
Copy link

hawtdawg commented Sep 13, 2022

I believe "osx-arm64" is for M1 Macs, if you have an Intel Mac that's the wrong version. I'd just try and install Miniconda from the website and use that, just in case that's the issue.

@corajr
Copy link

corajr commented Sep 13, 2022

@AntonEssenetial @hawtdawg is correct. Also, I believe your ldm environment exists and is broken due to the failed install.

Try removing it first with: conda env remove -n ldm

Then because you are on Intel, you should try modifying the command to be:

PIP_EXISTS_ACTION=w CONDA_SUBDIR=osx-64 conda env create -f environment-mac.yaml

osx-64 is the version for Intel so it should be compatible.

I'm not sure if someone has made a more complete guide for Intel Macs, but the default instructions on lstein may not work for you right now. I anticipate the nomkl package included in environment-mac.yaml may also cause problems for you, so you may want to try removing that line as well.

@hawtdawg
Copy link

@AntonEssenetial @hawtdawg is correct. Also, I believe your ldm environment exists and is broken due to the failed install.

Try removing it first with: conda env remove -n ldm

Then because you are on Intel, you should try modifying the command to be:

PIP_EXISTS_ACTION=w CONDA_SUBDIR=osx-64 conda env create -f environment-mac.yaml

osx-64 is the version for Intel so it should be compatible.

I'm not sure if someone has made a more complete guide for Intel Macs, but the default instructions on lstein may not work for you right now. I anticipate the nomkl package included in environment-mac.yaml may also cause problems for you, so you may want to try removing that line as well.

I got the lstein development branch to work on my Intel Mac without changing anything, with the only exception being installing Conda for Intel instead of Arm. So I think this is the only thing that needs to be done, and everything is the same from there on.

@AntonEssenetial
Copy link

I'm not sure if someone has made a more complete guide for Intel Macs

That would be great 😌.

@AntonEssenetial
Copy link

ldm is active but.

python scripts/dream.py --full_precision

  • Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 685, in
main()
File "/Users/anton/stable-diffusion/scripts/dream.py", line 101, in main
t2i.load_model()
File "/Users/anton/stable-diffusion/ldm/generate.py", line 426, in load_model
model = self._load_model_from_config(config, self.weights)
File "/Users/anton/stable-diffusion/ldm/generate.py", line 537, in _load_model_from_config
pl_sd = torch.load(ckpt, map_location='cpu')
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 705, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

@corajr
Copy link

corajr commented Sep 13, 2022

ldm is active but.

python scripts/dream.py --full_precision

  • Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 685, in
main()
File "/Users/anton/stable-diffusion/scripts/dream.py", line 101, in main
t2i.load_model()
File "/Users/anton/stable-diffusion/ldm/generate.py", line 426, in load_model
model = self._load_model_from_config(config, self.weights)
File "/Users/anton/stable-diffusion/ldm/generate.py", line 537, in _load_model_from_config
pl_sd = torch.load(ckpt, map_location='cpu')
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 705, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Do you have the model file at models/ldm/stable-diffusion-v1/model.ckpt? Maybe the file is damaged or it's a bad symlink?

@HenkPoley
Copy link

HenkPoley commented Sep 14, 2022

@Birch-san I there is a bug in your birch-mps-waifu branch that means txt2img_fork.py picks the last sample file filename number and overwrites it (on each start).

E.g. base_count is somehow one-off.

Edit: ah, I see. It just uses base_count = len(os.listdir(sample_path)) (and I deleted a few files 🤪) PEBCAK

@Birch-san
Copy link

@HenkPoley phew! I was scared for a moment there. since someone else reported the same thing, but I think they had also deleted some files themselves.

recent stuff I've been working on is trying to optimize attention (e.g. trying matmul instead of einsum, trying the changes from Doggettx / neonsecret, trying opt_einsum, trying cosine similarity attention) but none of those ideas improved the speed.

next thing I want to add is latent walks. I'm trying to do it without losing support for multi-prompt or multi-sample so it's a bit harder than copying existing code.

also want to look at better img2img capabilities.

@Birch-san
Copy link

fix underway for MPSNDArray error: product of dimension sizes > 2**31:
pytorch/pytorch#84039 (comment)

@AntonEssenetial
Copy link

ldm is active but.
python scripts/dream.py --full_precision

  • Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
Traceback (most recent call last):
File "/Users/anton/stable-diffusion/scripts/dream.py", line 685, in
main()
File "/Users/anton/stable-diffusion/scripts/dream.py", line 101, in main
t2i.load_model()
File "/Users/anton/stable-diffusion/ldm/generate.py", line 426, in load_model
model = self._load_model_from_config(config, self.weights)
File "/Users/anton/stable-diffusion/ldm/generate.py", line 537, in _load_model_from_config
pl_sd = torch.load(ckpt, map_location='cpu')
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 705, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/Users/anton/opt/miniconda3/envs/ldm/lib/python3.10/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Do you have the model file at models/ldm/stable-diffusion-v1/model.ckpt? Maybe the file is damaged or it's a bad symlink?

Thx, solved, file was damaged 🤦🏻‍♂️

@AntonEssenetial
Copy link

AntonEssenetial commented Sep 14, 2022

Maybe someone knows how to switch to GPU on macbook pro with AMD Radeon Pro 5500M 4 GB, for some reason it runs on the CPU.

(ldm) ➜ stable-diffusion git:(main) python scripts/dream.py --full_precision

  • Initializing, be patient...

cuda not available, using device mps
Loading model from models/ldm/stable-diffusion-v1/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using slower but more accurate full-precision math (--full_precision)
Model loaded in 18.70s
Setting Sampler to k_lms

  • Initialization done! Awaiting your command (-h for help, 'q' to quit)
    dream> teapot
    /Users/anton/stable-diffusion/ldm/modules/embedding_manager.py:153: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/miniforge3/conda-bld/pytorch-recipe_1660136156773/work/aten/src/ATen/mps/MPSFallback.mm:11.)
    placeholder_idx = torch.where(
@Vargol
Copy link

Vargol commented Sep 14, 2022

That message doesn't mean it all runs on CPU just the some instructions, check activity monitor you'll see it using a big chunk of GPU

@AntonEssenetial
Copy link

That message doesn't mean it all runs on CPU just the some instructions, check activity monitor you'll see it using a big chunk of GPU

thx

@HenkPoley
Copy link

Looks like PyTorch nightly 1.13.0.dev20220915 (or slightly earlier) fixes the 'leaked semaphore' problem (I might misattribute it to PyTorch).

Or at least I haven't seen it in a while.

~/miniconda3/envs/ldm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
@Birch-san
Copy link

btw, if you're running on a nightly build: beware that there's a bug with einsum() which will make cross-attention return the wrong result the first time it's invoked.
pytorch/pytorch#85224

@HenkPoley
Copy link

HenkPoley commented Sep 25, 2022

@Birch-san 👀 "Speed up stable diffusion by ~50% using flash attention"

https://twitter.com/labmlai/status/1573634095732490240

..might just be a CUDA thing, the way it (doesn't) have access to large caches.

https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/diffusion/stable_diffusion/model/unet_attention.py#L192-L235

@Birch-san
Copy link

@HenkPoley oh wow! I'll have a look into it, but I think the point of Flash attention is using the hardware better. as such, I think there's only a CUDA version of it at the moment. I know there's one merged into PyTorch, but by default isn't even built. perhaps this required building PyTorch from source, enabling that build flag, and relies on having CUDA? will investigate.

@Birch-san
Copy link

Birch-san commented Sep 25, 2022

okay yeah, it uses HazyResearch's implementation, which is definitely CUDA-specific.
maybe that Triton implementation will be more platform-agnostic. but their readme says they don't support M1.

@AkiKagura
Copy link

Hello everyone! I've got both x86_64(Anaconda) and arm64(Miniforge3) Conda environment, but the arm64 one runs much slower than the X86 one. I have no idea how to speed up in the arm64 structure cause it's slow as cpu.I am using the same code, and I'm using M1 silicon chip.

@Vargol
Copy link

Vargol commented Oct 8, 2022

@AkiKagura This repository has never been changed to enable MPS so it is running on the CPU on Apple Silicon.

Try another fork, I use this one
https://github.com/invoke-ai/InvokeAI
as it makes an effort to keep performance high on M1/M2

Also don't use the pytorch nightlies they've totally tanked torch.einsum performance recently, stick with the current stable
as that supports MPS enough for good performance from SD.

@AkiKagura
Copy link

AkiKagura commented Oct 8, 2022

@Vargol In fact, I am using a version of SD that has been modified to adapt mps support, and it could generate a picture in about 3-4 minutes on the x86 framework(but on arm64 it’s much slower).
But the pytorch I am using is a nightly version. Thanks for reminding me of using the stable version of pytorch, and I would try the fork you provide.

@Vargol
Copy link

Vargol commented Oct 8, 2022

@AkiKagura

For comparison I'm on a 8Gb Mac Mini and despite it still causing a little swap usage now and again I get 4 it/s so
~ 3.5 minutes for a 50 sample 512x512 image plus around 40-60 seconds for all the model loading (it really does vary a lot for some reason, probably due to the swap usage.).

I understand its significantly faster with 16Gb+. M1 Max @ 64Gb should take ~ 30 seconds per image for the 50 sample 512x512 image plus the model loading time according to the benchmarks I've seen.

@Any-Winter-4079
Copy link

Any-Winter-4079 commented Oct 8, 2022

@AkiKagura M1 Max with 64GB RAM. The time is 26-27s for 50 steps. It was 30s, but an extra optimization was added a while ago in https://github.com/invoke-ai/InvokeAI
"banana pie" -s50 -W512 -H512 -C7.5 -Ak_lms -n3

50/50 [00:26<00:00,  1.87it/s]
50/50 [00:26<00:00,  1.91it/s]
50/50 [00:26<00:00,  1.88it/s]

I suggest you use that repo in you are on M1 because a lot of people are there, collaborating, and it has many features, such as inpainting, outpainting, textual inversion, as well as allowing to generate large images (e.g. 1024x1024 and beyond) without out-of-memory problems, etc.

Also I suggest you check out @Birch-san 's repo.

andrewkchan added a commit to andrewkchan/deforum-stable-diffusion-mps that referenced this issue Dec 5, 2022
- Change occurrences (hardcodings, default arguments) of "cuda" to accept other torch devices ("mps", "cpu")
- Auto-detect and set torch device when running on appropriate hardware
- Don't use unsupported autocast when running on MPS, and always use full-precision (float32)
- Port CrossAttention optimizations from InvokeAI stable diffusion frontend, which dramatically speed up inference
- Fix seed instability caused by torch.randn not using the global seed on MPS hardware
- Various other bugfixes from (CompVis/stable-diffusion#25)
austinbrown34 pushed a commit to cognidesign/InvokeAI that referenced this issue Dec 30, 2022
I'm using stable-diffusion on a 2022 Macbook M2 Air with 24 GB unified memory.
I see this taking about 2.0s/it.

I've moved many deps from pip to conda-forge, to take advantage of the
precompiled binaries. Some notes for Mac users, since I've seen a lot of
confusion about this:

One doesn't need the `apple` channel to run this on a Mac-- that's only
used by `tensorflow-deps`, required for running tensorflow-metal. For
that, I have an example environment.yml here:

https://developer.apple.com/forums/thread/711792?answerId=723276022#723276022

However, the `CONDA_ENV=osx-arm64` environment variable *is* needed to
ensure that you do not run any Intel-specific packages such as `mkl`,
which will fail with [cryptic errors](CompVis/stable-diffusion#25 (comment))
on the ARM architecture and cause the environment to break.

I've also added a comment in the env file about 3.10 not working yet.
When it becomes possible to update, those commands run on an osx-arm64
machine should work to determine the new version set.

Here's what a successful run of dream.py should look like:

```
$ python scripts/dream.py --full_precision                                                                                                           SIGABRT(6) ↵  08:42:59
* Initializing, be patient...

Loading model from models/ldm/stable-diffusion-v1/model.ckpt
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Using slower but more accurate full-precision math (--full_precision)
>> Setting Sampler to k_lms
model loaded in 6.12s

* Initialization done! Awaiting your command (-h for help, 'q' to quit)
dream> "an astronaut riding a horse"
Generating:   0%|                                                                                                                                                                         | 0/1 [00:00<?, ?it/s]/Users/corajr/Documents/lstein/ldm/modules/embedding_manager.py:152: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1662016319283/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  placeholder_idx = torch.where(
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:37<00:00,  1.95s/it]
Generating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:38<00:00, 98.55s/it]
Usage stats:
   1 image(s) generated in 98.60s
   Max VRAM used for this generation: 0.00G
Outputs:
outputs/img-samples/000001.1525943180.png: "an astronaut riding a horse" -s50 -W512 -H512 -C7.5 -Ak_lms -F -S1525943180
```
@BeginAnAdventure
Copy link

Does anyone know what commands will increase the resolution to photographic quality, like you get from the Stable Diffusion website, and if you can get more than one image at a time. This is the only command I have right now to define the output:
--n_samples 1 --n_iter 1 --plms

@Raymonf
Copy link

Raymonf commented Feb 4, 2023

@BeginAnAdventure In txt2img, --W and --H:

parser.add_argument(
"--H",
type=int,
default=512,
help="image height, in pixel space",
)
parser.add_argument(
"--W",
type=int,
default=512,
help="image width, in pixel space",
)

For more than one image at a time, you might have a bad time, but the option is actually --n_samples.

But do consider using InvokeAI or some other UI - the scripts in this repository aren't exactly "feature complete".

@BeginAnAdventure
Copy link

But do consider using InvokeAI or some other UI - the scripts in this repository aren't exactly "feature complete".

That's what I thought. I did try using this one:
https://replicate.com/cjwbw/stable-diffusion-high-resolution

But I keep getting errors despite no issues with doing all in Terminal. Will check out InvokeAI, just wondering if I'll need to create a whole new setup for their UI. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet