Merge branch 'main' into controlnet-test-fixes

add: torch to the pypi step. (#7328 )
[Tests] Update a deprecated parameter in test files and fix several typos (#7277 )
2024-03-15 12:30:25 +05:30 · 2024-03-15 12:28:12 +05:30 · 2024-03-14 12:17:35 -07:00 · 2024-03-14 20:51:22 +05:30 · 2024-03-14 18:40:14 +05:30 · 2024-03-14 11:52:25 +01:00
85 changed files with 603 additions and 322 deletions
@@ -65,6 +65,7 @@ jobs:
          python -m uv pip install -e [quality,test]
          python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers
          python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate
+          python -m uv pip install pytest-reportlog

      - name: Environment
        run: |
@@ -150,6 +151,7 @@ jobs:
          ${CONDA_RUN} python -m uv pip install -e [quality,test]
          ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
          ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate
+          ${CONDA_RUN} python -m uv pip install pytest-reportlog

      - name: Environment
        shell: arch -arch arm64 bash {0}
@@ -52,7 +52,7 @@ jobs:
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
-          pip install -U setuptools wheel twine
+          pip install -U setuptools wheel twine torch
      
      - name: Build the dist files
        run: python setup.py bdist_wheel && python setup.py sdist
@@ -77,7 +77,7 @@ Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggi

 ## Quickstart

-Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 19000+ checkpoints):
+Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 22000+ checkpoints):

 ```python
 from diffusers import DiffusionPipeline
@@ -219,7 +219,7 @@ Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz9
 - https://github.com/deep-floyd/IF
 - https://github.com/bentoml/BentoML
 - https://github.com/bmaltais/kohya_ss
- +8000 other amazing GitHub repositories 💪
+- +9000 other amazing GitHub repositories 💪

 Thank you for using us ❤️.

@@ -404,6 +404,10 @@
      title: EulerAncestralDiscreteScheduler
    - local: api/schedulers/euler
      title: EulerDiscreteScheduler
+    - local: api/schedulers/edm_euler
+      title: EDMEulerScheduler
+    - local: api/schedulers/edm_multistep_dpm_solver
+      title: EDMDPMSolverMultistepScheduler
    - local: api/schedulers/heun
      title: HeunDiscreteScheduler
    - local: api/schedulers/ipndm
@@ -0,0 +1,22 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# EDMEulerScheduler
+
+The Karras formulation of the Euler scheduler (Algorithm 2) from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/).
+
+
+## EDMEulerScheduler
+[[autodoc]] EDMEulerScheduler
+
+## EDMEulerSchedulerOutput
+[[autodoc]] schedulers.scheduling_edm_euler.EDMEulerSchedulerOutput
@@ -0,0 +1,24 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# EDMDPMSolverMultistepScheduler
+
+`EDMDPMSolverMultistepScheduler` is a [Karras formulation](https://huggingface.co/papers/2206.00364) of `DPMSolverMultistep`, a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu.
+
+DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality
+samples, and it can generate quite good samples even in 10 steps.
+
+## EDMDPMSolverMultistepScheduler
+[[autodoc]] EDMDPMSolverMultistepScheduler
+
+## SchedulerOutput
+[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
@@ -259,6 +259,50 @@ pip install git+https://github.com/huggingface/peft.git
 **Inference** 
 The inference is the same as if you train a regular LoRA 🤗

+## Conducting EDM-style training
+
+It's now possible to perform EDM-style training as proposed in [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364). 
+
+simply set:
+
+```diff
+  --do_edm_style_training \
+```
+
+Other SDXL-like models that use the EDM formulation, such as [playgroundai/playground-v2.5-1024px-aesthetic](https://huggingface.co/playgroundai/playground-v2.5-1024px-aesthetic), can also be DreamBooth'd with the script. Below is an example command:
+
+```bash
+accelerate launch train_dreambooth_lora_sdxl_advanced.py \
+  --pretrained_model_name_or_path="playgroundai/playground-v2.5-1024px-aesthetic"  \
+  --dataset_name="linoyts/3d_icon" \
+  --instance_prompt="3d icon in the style of TOK" \
+  --validation_prompt="a TOK icon of an astronaut riding a horse, in the style of TOK" \
+  --output_dir="3d-icon-SDXL-LoRA" \
+  --do_edm_style_training \
+  --caption_column="prompt" \
+  --mixed_precision="bf16" \
+  --resolution=1024 \
+  --train_batch_size=3 \
+  --repeats=1 \
+  --report_to="wandb"\
+  --gradient_accumulation_steps=1 \
+  --gradient_checkpointing \
+  --learning_rate=1.0 \
+  --text_encoder_lr=1.0 \
+  --optimizer="prodigy"\
+  --train_text_encoder_ti\
+  --train_text_encoder_ti_frac=0.5\
+  --lr_scheduler="constant" \
+  --lr_warmup_steps=0 \
+  --rank=8 \
+  --max_train_steps=1000 \
+  --checkpointing_steps=2000 \
+  --seed="0" \
+  --push_to_hub
+```
+
+> [!CAUTION]
+> Min-SNR gamma is not supported with the EDM-style training yet. When training with the PlaygroundAI model, it's recommended to not pass any "variant".

 ### Tips and Tricks
 Check out [these recommended practices](https://huggingface.co/blog/sdxl_lora_advanced_script#additional-good-practices)
@@ -14,9 +14,11 @@
 # See the License for the specific language governing permissions and

 import argparse
+import contextlib
 import gc
 import hashlib
 import itertools
+import json
 import logging
 import math
 import os
@@ -37,7 +39,7 @@ import transformers
 from accelerate import Accelerator
 from accelerate.logging import get_logger
 from accelerate.utils import DistributedDataParallelKwargs, ProjectConfiguration, set_seed
-from huggingface_hub import create_repo, upload_folder
+from huggingface_hub import create_repo, hf_hub_download, upload_folder
 from packaging import version
 from peft import LoraConfig, set_peft_model_state_dict
 from peft.utils import get_peft_model_state_dict
@@ -55,6 +57,8 @@ from diffusers import (
    AutoencoderKL,
    DDPMScheduler,
    DPMSolverMultistepScheduler,
+    EDMEulerScheduler,
+    EulerDiscreteScheduler,
    StableDiffusionXLPipeline,
    UNet2DConditionModel,
 )
@@ -79,6 +83,20 @@ check_min_version("0.27.0.dev0")
 logger = get_logger(__name__)


+def determine_scheduler_type(pretrained_model_name_or_path, revision):
+    model_index_filename = "model_index.json"
+    if os.path.isdir(pretrained_model_name_or_path):
+        model_index = os.path.join(pretrained_model_name_or_path, model_index_filename)
+    else:
+        model_index = hf_hub_download(
+            repo_id=pretrained_model_name_or_path, filename=model_index_filename, revision=revision
+        )
+
+    with open(model_index, "r") as f:
+        scheduler_type = json.load(f)["scheduler"][1]
+    return scheduler_type
+
+
 def save_model_card(
    repo_id: str,
    use_dora: bool,
@@ -370,6 +388,11 @@ def parse_args(input_args=None):
            " `args.validation_prompt` multiple times: `args.num_validation_images`."
        ),
    )
+    parser.add_argument(
+        "--do_edm_style_training",
+        action="store_true",
+        help="Flag to conduct training using the EDM formulation as introduced in https://arxiv.org/abs/2206.00364.",
+    )
    parser.add_argument(
        "--with_prior_preservation",
        default=False,
@@ -1117,6 +1140,8 @@ def main(args):
            "You cannot use both --report_to=wandb and --hub_token due to a security risk of exposing your token."
            " Please use `huggingface-cli login` to authenticate with the Hub."
        )
+    if args.do_edm_style_training and args.snr_gamma is not None:
+        raise ValueError("Min-SNR formulation is not supported when conducting EDM-style training.")

    logging_dir = Path(args.output_dir, args.logging_dir)

@@ -1234,7 +1259,19 @@ def main(args):
    )

    # Load scheduler and models
-    noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+    scheduler_type = determine_scheduler_type(args.pretrained_model_name_or_path, args.revision)
+    if "EDM" in scheduler_type:
+        args.do_edm_style_training = True
+        noise_scheduler = EDMEulerScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+        logger.info("Performing EDM-style training!")
+    elif args.do_edm_style_training:
+        noise_scheduler = EulerDiscreteScheduler.from_pretrained(
+            args.pretrained_model_name_or_path, subfolder="scheduler"
+        )
+        logger.info("Performing EDM-style training!")
+    else:
+        noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
+
    text_encoder_one = text_encoder_cls_one.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision, variant=args.variant
    )
@@ -1252,7 +1289,12 @@ def main(args):
        revision=args.revision,
        variant=args.variant,
    )
-    vae_scaling_factor = vae.config.scaling_factor
+    latents_mean = latents_std = None
+    if hasattr(vae.config, "latents_mean") and vae.config.latents_mean is not None:
+        latents_mean = torch.tensor(vae.config.latents_mean).view(1, 4, 1, 1)
+    if hasattr(vae.config, "latents_std") and vae.config.latents_std is not None:
+        latents_std = torch.tensor(vae.config.latents_std).view(1, 4, 1, 1)
+
    unet = UNet2DConditionModel.from_pretrained(
        args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision, variant=args.variant
    )
@@ -1790,6 +1832,19 @@ def main(args):
        disable=not accelerator.is_local_main_process,
    )

+    def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
+        # TODO: revisit other sampling algorithms
+        sigmas = noise_scheduler.sigmas.to(device=accelerator.device, dtype=dtype)
+        schedule_timesteps = noise_scheduler.timesteps.to(accelerator.device)
+        timesteps = timesteps.to(accelerator.device)
+
+        step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
+
+        sigma = sigmas[step_indices].flatten()
+        while len(sigma.shape) < n_dim:
+            sigma = sigma.unsqueeze(-1)
+        return sigma
+
    if args.train_text_encoder:
        num_train_epochs_text_encoder = int(args.train_text_encoder_frac * args.num_train_epochs)
    elif args.train_text_encoder_ti:  # args.train_text_encoder_ti
@@ -1841,9 +1896,15 @@ def main(args):
                    pixel_values = batch["pixel_values"].to(dtype=vae.dtype)
                    model_input = vae.encode(pixel_values).latent_dist.sample()

-                model_input = model_input * vae_scaling_factor
-                if args.pretrained_vae_model_name_or_path is None:
-                    model_input = model_input.to(weight_dtype)
+                if latents_mean is None and latents_std is None:
+                    model_input = model_input * vae.config.scaling_factor
+                    if args.pretrained_vae_model_name_or_path is None:
+                        model_input = model_input.to(weight_dtype)
+                else:
+                    latents_mean = latents_mean.to(device=model_input.device, dtype=model_input.dtype)
+                    latents_std = latents_std.to(device=model_input.device, dtype=model_input.dtype)
+                    model_input = (model_input - latents_mean) * vae.config.scaling_factor / latents_std
+                    model_input = model_input.to(dtype=weight_dtype)

                # Sample noise that we'll add to the latents
                noise = torch.randn_like(model_input)
@@ -1854,15 +1915,32 @@ def main(args):
                    )

                bsz = model_input.shape[0]
+
                # Sample a random timestep for each image
-                timesteps = torch.randint(
-                    0, noise_scheduler.config.num_train_timesteps, (bsz,), device=model_input.device
-                )
-                timesteps = timesteps.long()
+                if not args.do_edm_style_training:
+                    timesteps = torch.randint(
+                        0, noise_scheduler.config.num_train_timesteps, (bsz,), device=model_input.device
+                    )
+                    timesteps = timesteps.long()
+                else:
+                    # in EDM formulation, the model is conditioned on the pre-conditioned noise levels
+                    # instead of discrete timesteps, so here we sample indices to get the noise levels
+                    # from `scheduler.timesteps`
+                    indices = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,))
+                    timesteps = noise_scheduler.timesteps[indices].to(device=model_input.device)

                # Add noise to the model input according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_model_input = noise_scheduler.add_noise(model_input, noise, timesteps)
+                # For EDM-style training, we first obtain the sigmas based on the continuous timesteps.
+                # We then precondition the final model inputs based on these sigmas instead of the timesteps.
+                # Follow: Section 5 of https://arxiv.org/abs/2206.00364.
+                if args.do_edm_style_training:
+                    sigmas = get_sigmas(timesteps, len(noisy_model_input.shape), noisy_model_input.dtype)
+                    if "EDM" in scheduler_type:
+                        inp_noisy_latents = noise_scheduler.precondition_inputs(noisy_model_input, sigmas)
+                    else:
+                        inp_noisy_latents = noisy_model_input / ((sigmas**2 + 1) ** 0.5)

                # time ids
                add_time_ids = torch.cat(
@@ -1888,7 +1966,7 @@ def main(args):
                    }
                    prompt_embeds_input = prompt_embeds.repeat(elems_to_repeat_text_embeds, 1, 1)
                    model_pred = unet(
-                        noisy_model_input,
+                        inp_noisy_latents if args.do_edm_style_training else noisy_model_input,
                        timesteps,
                        prompt_embeds_input,
                        added_cond_kwargs=unet_added_conditions,
@@ -1906,14 +1984,42 @@ def main(args):
                    )
                    prompt_embeds_input = prompt_embeds.repeat(elems_to_repeat_text_embeds, 1, 1)
                    model_pred = unet(
-                        noisy_model_input, timesteps, prompt_embeds_input, added_cond_kwargs=unet_added_conditions
+                        inp_noisy_latents if args.do_edm_style_training else noisy_model_input,
+                        timesteps,
+                        prompt_embeds_input,
+                        added_cond_kwargs=unet_added_conditions,
                    ).sample

+                weighting = None
+                if args.do_edm_style_training:
+                    # Similar to the input preconditioning, the model predictions are also preconditioned
+                    # on noised model inputs (before preconditioning) and the sigmas.
+                    # Follow: Section 5 of https://arxiv.org/abs/2206.00364.
+                    if "EDM" in scheduler_type:
+                        model_pred = noise_scheduler.precondition_outputs(noisy_model_input, model_pred, sigmas)
+                    else:
+                        if noise_scheduler.config.prediction_type == "epsilon":
+                            model_pred = model_pred * (-sigmas) + noisy_model_input
+                        elif noise_scheduler.config.prediction_type == "v_prediction":
+                            model_pred = model_pred * (-sigmas / (sigmas**2 + 1) ** 0.5) + (
+                                noisy_model_input / (sigmas**2 + 1)
+                            )
+                    # We are not doing weighting here because it tends result in numerical problems.
+                    # See: https://github.com/huggingface/diffusers/pull/7126#issuecomment-1968523051
+                    # There might be other alternatives for weighting as well:
+                    # https://github.com/huggingface/diffusers/pull/7126#discussion_r1505404686
+                    if "EDM" not in scheduler_type:
+                        weighting = (sigmas**-2.0).float()
+
                # Get the target for loss depending on the prediction type
                if noise_scheduler.config.prediction_type == "epsilon":
-                    target = noise
+                    target = model_input if args.do_edm_style_training else noise
                elif noise_scheduler.config.prediction_type == "v_prediction":
-                    target = noise_scheduler.get_velocity(model_input, noise, timesteps)
+                    target = (
+                        model_input
+                        if args.do_edm_style_training
+                        else noise_scheduler.get_velocity(model_input, noise, timesteps)
+                    )
                else:
                    raise ValueError(f"Unknown prediction type {noise_scheduler.config.prediction_type}")

@@ -1923,10 +2029,28 @@ def main(args):
                    target, target_prior = torch.chunk(target, 2, dim=0)

                    # Compute prior loss
-                    prior_loss = F.mse_loss(model_pred_prior.float(), target_prior.float(), reduction="mean")
+                    if weighting is not None:
+                        prior_loss = torch.mean(
+                            (weighting.float() * (model_pred_prior.float() - target_prior.float()) ** 2).reshape(
+                                target_prior.shape[0], -1
+                            ),
+                            1,
+                        )
+                        prior_loss = prior_loss.mean()
+                    else:
+                        prior_loss = F.mse_loss(model_pred_prior.float(), target_prior.float(), reduction="mean")

                if args.snr_gamma is None:
-                    loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
+                    if weighting is not None:
+                        loss = torch.mean(
+                            (weighting.float() * (model_pred.float() - target.float()) ** 2).reshape(
+                                target.shape[0], -1
+                            ),
+                            1,
+                        )
+                        loss = loss.mean()
+                    else:
+                        loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
                else:
                    # Compute loss-weights as per Section 3.4 of https://arxiv.org/abs/2303.09556.
                    # Since we predict the noise instead of x_0, the original formulation is slightly changed.
@@ -2049,17 +2173,18 @@ def main(args):
                # We train on the simplified learning objective. If we were previously predicting a variance, we need the scheduler to ignore it
                scheduler_args = {}

-                if "variance_type" in pipeline.scheduler.config:
-                    variance_type = pipeline.scheduler.config.variance_type
+                if not args.do_edm_style_training:
+                    if "variance_type" in pipeline.scheduler.config:
+                        variance_type = pipeline.scheduler.config.variance_type

-                    if variance_type in ["learned", "learned_range"]:
-                        variance_type = "fixed_small"
+                        if variance_type in ["learned", "learned_range"]:
+                            variance_type = "fixed_small"

-                    scheduler_args["variance_type"] = variance_type
+                        scheduler_args["variance_type"] = variance_type

-                pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
-                    pipeline.scheduler.config, **scheduler_args
-                )
+                    pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
+                        pipeline.scheduler.config, **scheduler_args
+                    )

                pipeline = pipeline.to(accelerator.device)
                pipeline.set_progress_bar_config(disable=True)
@@ -2067,8 +2192,13 @@ def main(args):
                # run inference
                generator = torch.Generator(device=accelerator.device).manual_seed(args.seed) if args.seed else None
                pipeline_args = {"prompt": args.validation_prompt}
+                inference_ctx = (
+                    contextlib.nullcontext()
+                    if "playground" in args.pretrained_model_name_or_path
+                    else torch.cuda.amp.autocast()
+                )

-                with torch.cuda.amp.autocast():
+                with inference_ctx:
                    images = [
                        pipeline(**pipeline_args, generator=generator).images[0]
                        for _ in range(args.num_validation_images)
@@ -2144,15 +2274,18 @@ def main(args):
            # We train on the simplified learning objective. If we were previously predicting a variance, we need the scheduler to ignore it
            scheduler_args = {}

-            if "variance_type" in pipeline.scheduler.config:
-                variance_type = pipeline.scheduler.config.variance_type
+            if not args.do_edm_style_training:
+                if "variance_type" in pipeline.scheduler.config:
+                    variance_type = pipeline.scheduler.config.variance_type

-                if variance_type in ["learned", "learned_range"]:
-                    variance_type = "fixed_small"
+                    if variance_type in ["learned", "learned_range"]:
+                        variance_type = "fixed_small"

-                scheduler_args["variance_type"] = variance_type
+                    scheduler_args["variance_type"] = variance_type

-            pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, **scheduler_args)
+                pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
+                    pipeline.scheduler.config, **scheduler_args
+                )

            # load attention processors
            pipeline.load_lora_weights(args.output_dir)
@@ -637,7 +637,7 @@ def main(args):
                    generator=generator,
                    batch_size=args.eval_batch_size,
                    num_inference_steps=args.ddpm_num_inference_steps,
-                    output_type="numpy",
+                    output_type="np",
                ).images

                if args.use_ema:
@@ -425,6 +425,11 @@ def parse_args(input_args=None):
        default=4,
        help=("The dimension of the LoRA update matrices."),
    )
+    parser.add_argument(
+        "--debug_loss",
+        action="store_true",
+        help="debug loss for each image, if filenames are awailable in the dataset",
+    )

    if input_args is not None:
        args = parser.parse_args(input_args)
@@ -603,6 +608,7 @@ def main(args):
    # Move unet, vae and text_encoder to device and cast to weight_dtype
    # The VAE is in float32 to avoid NaN losses.
    unet.to(accelerator.device, dtype=weight_dtype)
+
    if args.pretrained_vae_model_name_or_path is None:
        vae.to(accelerator.device, dtype=torch.float32)
    else:
@@ -890,13 +896,17 @@ def main(args):
        tokens_one, tokens_two = tokenize_captions(examples)
        examples["input_ids_one"] = tokens_one
        examples["input_ids_two"] = tokens_two
+        if args.debug_loss:
+            fnames = [os.path.basename(image.filename) for image in examples[image_column] if image.filename]
+            if fnames:
+                examples["filenames"] = fnames
        return examples

    with accelerator.main_process_first():
        if args.max_train_samples is not None:
            dataset["train"] = dataset["train"].shuffle(seed=args.seed).select(range(args.max_train_samples))
        # Set the training transforms
-        train_dataset = dataset["train"].with_transform(preprocess_train)
+        train_dataset = dataset["train"].with_transform(preprocess_train, output_all_columns=True)

    def collate_fn(examples):
        pixel_values = torch.stack([example["pixel_values"] for example in examples])
@@ -905,7 +915,7 @@ def main(args):
        crop_top_lefts = [example["crop_top_lefts"] for example in examples]
        input_ids_one = torch.stack([example["input_ids_one"] for example in examples])
        input_ids_two = torch.stack([example["input_ids_two"] for example in examples])
-        return {
+        result = {
            "pixel_values": pixel_values,
            "input_ids_one": input_ids_one,
            "input_ids_two": input_ids_two,
@@ -913,6 +923,11 @@ def main(args):
            "crop_top_lefts": crop_top_lefts,
        }

+        filenames = [example["filenames"] for example in examples if "filenames" in example]
+        if filenames:
+            result["filenames"] = filenames
+        return result
+
    # DataLoaders creation:
    train_dataloader = torch.utils.data.DataLoader(
        train_dataset,
@@ -1105,7 +1120,9 @@ def main(args):
                    loss = F.mse_loss(model_pred.float(), target.float(), reduction="none")
                    loss = loss.mean(dim=list(range(1, len(loss.shape)))) * mse_loss_weights
                    loss = loss.mean()
-
+                if args.debug_loss and "filenames" in batch:
+                    for fname in batch["filenames"]:
+                        accelerator.log({"loss_for_" + fname: loss}, step=global_step)
                # Gather the losses across all processes for logging (if we use distributed training).
                avg_loss = accelerator.gather(loss.repeat(args.train_batch_size)).mean()
                train_loss += avg_loss.item() / args.gradient_accumulation_steps
@@ -648,7 +648,7 @@ def main(args):
                    generator=generator,
                    batch_size=args.eval_batch_size,
                    num_inference_steps=args.ddpm_num_inference_steps,
-                    output_type="numpy",
+                    output_type="np",
                ).images

                if args.use_ema:
@@ -454,7 +454,8 @@ def set_image_size(pipeline_class_name, original_config, checkpoint, image_size=
    model_type = infer_model_type(original_config, checkpoint, model_type)

    if pipeline_class_name == "StableDiffusionUpscalePipeline":
-        return 512
+        image_size = original_config["model"]["params"]["unet_config"]["params"]["image_size"]
+        return image_size

    elif model_type in ["SDXL", "SDXL-Refiner", "Playground"]:
        image_size = 1024
@@ -293,7 +293,7 @@ class BasicTransformerBlock(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        # Notice that normalization is always applied before the real computation in the following blocks.
        # 0. Self-Attention
@@ -308,7 +308,7 @@ class Transformer2DModel(ModelMixin, ConfigMixin):
        """
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")
        # ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
        #   we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
        #   we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
@@ -846,7 +846,7 @@ class UNetMidBlock2DCrossAttn(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        hidden_states = self.resnets[0](hidden_states, temb)
        for attn, resnet in zip(self.attentions, self.resnets[1:]):
@@ -986,7 +986,7 @@ class UNetMidBlock2DSimpleCrossAttn(nn.Module):
    ) -> torch.FloatTensor:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        if attention_mask is None:
            # if encoder_hidden_states is defined: we are doing cross-attn, so we should use cross-attn mask.
@@ -1116,7 +1116,7 @@ class AttnDownBlock2D(nn.Module):
    ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        output_states = ()

@@ -1241,7 +1241,7 @@ class CrossAttnDownBlock2D(nn.Module):
    ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        output_states = ()

@@ -1986,7 +1986,7 @@ class SimpleCrossAttnDownBlock2D(nn.Module):
    ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        output_states = ()

@@ -2201,7 +2201,7 @@ class KCrossAttnDownBlock2D(nn.Module):
    ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        output_states = ()

@@ -2483,7 +2483,7 @@ class CrossAttnUpBlock2D(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        is_freeu_enabled = (
            getattr(self, "s1", None)
@@ -3312,7 +3312,7 @@ class SimpleCrossAttnUpBlock2D(nn.Module):
    ) -> torch.FloatTensor:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        if attention_mask is None:
            # if encoder_hidden_states is defined: we are doing cross-attn, so we should use cross-attn mask.
@@ -3694,7 +3694,7 @@ class KAttentionBlock(nn.Module):
    ) -> torch.FloatTensor:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        # 1. Self-Attention
        if self.add_self_attention:
@@ -1183,7 +1183,7 @@ class CrossAttnDownBlockMotion(nn.Module):
    ):
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        output_states = ()

@@ -1367,7 +1367,7 @@ class CrossAttnUpBlockMotion(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        is_freeu_enabled = (
            getattr(self, "s1", None)
@@ -1707,7 +1707,7 @@ class UNetMidBlockCrossAttnMotion(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        hidden_states = self.resnets[0](hidden_states, temb)

@@ -127,7 +127,7 @@ class AmusedImg2ImgPipeline(DiffusionPipeline):
                on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
                process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
                essentially ignores `image`.
-            num_inference_steps (`int`, *optional*, defaults to 16):
+            num_inference_steps (`int`, *optional*, defaults to 12):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            guidance_scale (`float`, *optional*, defaults to 10.0):
@@ -191,7 +191,7 @@ class AmusedImg2ImgPipeline(DiffusionPipeline):
            negative_prompt_embeds is None and negative_encoder_hidden_states is not None
        ):
            raise ValueError(
-                "pass either both `negatve_prompt_embeds` and `negative_encoder_hidden_states` or neither"
+                "pass either both `negative_prompt_embeds` and `negative_encoder_hidden_states` or neither"
            )

        if (prompt is None and prompt_embeds is None) or (prompt is not None and prompt_embeds is not None):
@@ -824,20 +824,22 @@ class StableDiffusionControlNetPipeline(
        return latents

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -869,20 +869,22 @@ class StableDiffusionXLControlNetPipeline(
            self.vae.decoder.mid_block.to(dtype)

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -133,7 +133,7 @@ class SpectrogramDiffusionPipeline(DiffusionPipeline):
        generator: Optional[torch.Generator] = None,
        num_inference_steps: int = 100,
        return_dict: bool = True,
-        output_type: str = "numpy",
+        output_type: str = "np",
        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
        callback_steps: int = 1,
    ) -> Union[AudioPipelineOutput, Tuple]:
@@ -157,7 +157,7 @@ class SpectrogramDiffusionPipeline(DiffusionPipeline):
                expense of slower inference.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.AudioPipelineOutput`] instead of a plain tuple.
-            output_type (`str`, *optional*, defaults to `"numpy"`):
+            output_type (`str`, *optional*, defaults to `"np"`):
                The output format of the generated audio.
            callback (`Callable`, *optional*):
                A function that calls every `callback_steps` steps during inference. The function is called with the
@@ -249,16 +249,16 @@ class SpectrogramDiffusionPipeline(DiffusionPipeline):

            logger.info("Generated segment", i)

-        if output_type == "numpy" and not is_onnx_available():
+        if output_type == "np" and not is_onnx_available():
            raise ValueError(
                "Cannot return output in 'np' format if ONNX is not available. Make sure to have ONNX installed or set 'output_type' to 'mel'."
            )
-        elif output_type == "numpy" and self.melgan is None:
+        elif output_type == "np" and self.melgan is None:
            raise ValueError(
                "Cannot return output in 'np' format if melgan component is not defined. Make sure to define `self.melgan` or set 'output_type' to 'mel'."
            )

-        if output_type == "numpy":
+        if output_type == "np":
            output = self.melgan(input_features=full_pred_mel.astype(np.float32))
        else:
            output = full_pred_mel
@@ -2004,7 +2004,7 @@ class CrossAttnUpBlockFlat(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        is_freeu_enabled = (
            getattr(self, "s1", None)
@@ -2338,7 +2338,7 @@ class UNetMidBlockFlatCrossAttn(nn.Module):
    ) -> torch.FloatTensor:
        if cross_attention_kwargs is not None:
            if cross_attention_kwargs.get("scale", None) is not None:
-                logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+                logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        hidden_states = self.resnets[0](hidden_states, temb)
        for attn, resnet in zip(self.attentions, self.resnets[1:]):
@@ -2479,7 +2479,7 @@ class UNetMidBlockFlatSimpleCrossAttn(nn.Module):
    ) -> torch.FloatTensor:
        cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {}
        if cross_attention_kwargs.get("scale", None) is not None:
-            logger.warning("Passing `scale` to `cross_attention_kwargs` is depcrecated. `scale` will be ignored.")
+            logger.warning("Passing `scale` to `cross_attention_kwargs` is deprecated. `scale` will be ignored.")

        if attention_mask is None:
            # if encoder_hidden_states is defined: we are doing cross-attn, so we should use cross-attn mask.
@@ -548,20 +548,22 @@ class LatentConsistencyModelImg2ImgPipeline(
        return latents

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -490,20 +490,22 @@ class LatentConsistencyModelPipeline(
        latents = latents * self.scheduler.init_noise_sigma
        return latents

-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -713,20 +713,22 @@ class LEditsPPPipelineStableDiffusionXL(
            self.vae.decoder.mid_block.to(dtype)

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -669,20 +669,22 @@ class StableDiffusionPipeline(
        return latents

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -758,7 +758,6 @@ class StableDiffusionImg2ImgPipeline(
            init_latents = torch.cat([init_latents], dim=0)

        shape = init_latents.shape
-        print(shape)
        noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)

        # get latents
@@ -768,20 +767,22 @@ class StableDiffusionImg2ImgPipeline(
        return latents

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -909,20 +909,22 @@ class StableDiffusionInpaintPipeline(
        return timesteps, num_inference_steps - t_start

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -1304,7 +1304,7 @@ class StableDiffusionDiffEditPipeline(
        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
        callback_steps: int = 1,
        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
-        clip_ckip: int = None,
+        clip_skip: int = None,
    ):
        r"""
        The call function to the pipeline for generation.
@@ -1426,7 +1426,7 @@ class StableDiffusionDiffEditPipeline(
            prompt_embeds=prompt_embeds,
            negative_prompt_embeds=negative_prompt_embeds,
            lora_scale=text_encoder_lora_scale,
-            clip_skip=clip_ckip,
+            clip_skip=clip_skip,
        )
        # For classifier free guidance, we need to do two forward passes.
        # Here we concatenate the unconditional and text embeddings into a single batch
@@ -644,20 +644,22 @@ class StableDiffusionLDM3DPipeline(
        return latents

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -632,7 +632,7 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, StableDiffusionMixin, Textua
        # corresponds to doing no classifier free guidance.
        do_classifier_free_guidance = guidance_scale > 1.0
        # and `sag_scale` is` `s` of equation (16)
-        # of the self-attentnion guidance paper: https://arxiv.org/pdf/2210.00939.pdf
+        # of the self-attention guidance paper: https://arxiv.org/pdf/2210.00939.pdf
        # `sag_scale = 0` means no self-attention guidance
        do_self_attention_guidance = sag_scale > 0.0

@@ -667,7 +667,7 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, StableDiffusionMixin, Textua

        if timesteps.dtype not in [torch.int16, torch.int32, torch.int64]:
            raise ValueError(
-                f"{self.__class__.__name__} does not support using a scheduler of type {self.scheduler.__class__.__name__}. Please make sure to use one of 'DDIMScheduler, PNDMScheduler, DDPMScheduler, DEISMultistepScheduler, UniPCMultistepScheduler, DPMSolverMultistepScheduler, DPMSolverSinlgestepScheduler'."
+                f"{self.__class__.__name__} does not support using a scheduler of type {self.scheduler.__class__.__name__}. Please make sure to use one of 'DDIMScheduler, PNDMScheduler, DDPMScheduler, DEISMultistepScheduler, UniPCMultistepScheduler, DPMSolverMultistepScheduler, DPMSolverSinglestepScheduler'."
            )

        # 5. Prepare latent variables
@@ -723,7 +723,7 @@ class StableDiffusionSAGPipeline(DiffusionPipeline, StableDiffusionMixin, Textua
                        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

-                    # perform self-attention guidance with the stored self-attentnion map
+                    # perform self-attention guidance with the stored self-attention map
                    if do_self_attention_guidance:
                        # classifier-free guidance produces two chunks of attention map
                        # and we only use unconditional one according to equation (25)
@@ -740,20 +740,22 @@ class StableDiffusionXLPipeline(
            self.vae.decoder.mid_block.to(dtype)

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -874,20 +874,22 @@ class StableDiffusionXLImg2ImgPipeline(
            self.vae.decoder.mid_block.to(dtype)

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -1110,20 +1110,22 @@ class StableDiffusionXLInpaintPipeline(
            self.vae.decoder.mid_block.to(dtype)

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -613,20 +613,22 @@ class StableDiffusionAdapterPipeline(DiffusionPipeline, StableDiffusionMixin):
        return height, width

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -784,20 +784,22 @@ class StableDiffusionXLAdapterPipeline(
        return height, width

    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32):
+    def get_guidance_scale_embedding(
+        self, w: torch.Tensor, embedding_dim: int = 512, dtype: torch.dtype = torch.float32
+    ) -> torch.FloatTensor:
        """
        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

        Args:
-            timesteps (`torch.Tensor`):
-                generate embedding vectors at these timesteps
+            w (`torch.Tensor`):
+                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
            embedding_dim (`int`, *optional*, defaults to 512):
-                dimension of the embeddings to generate
-            dtype:
-                data type of the generated embeddings
+                Dimension of the embeddings to generate.
+            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
+                Data type of the generated embeddings.

        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
        """
        assert len(w.shape) == 1
        w = w * 1000.0
@@ -575,8 +575,8 @@ class TextToVideoZeroPipeline(DiffusionPipeline, StableDiffusionMixin, TextualIn
                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video
                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
                tensor is generated by sampling using the supplied random `generator`.
-            output_type (`str`, *optional*, defaults to `"numpy"`):
-                The output format of the generated video. Choose between `"latent"` and `"numpy"`.
+            output_type (`str`, *optional*, defaults to `"np"`):
+                The output format of the generated video. Choose between `"latent"` and `"np"`.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a
                [`~pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput`] instead of
@@ -223,6 +223,8 @@ class DPMSolverSinglestepScheduler(SchedulerMixin, ConfigMixin):
        """
        steps = num_inference_steps
        order = self.config.solver_order
+        if order > 3:
+            raise ValueError("Order > 3 is not supported by this scheduler")
        if self.config.lower_order_final:
            if order == 3:
                if steps % 3 == 0:
@@ -829,7 +829,7 @@ class AutoencoderKLIntegrationTests(unittest.TestCase):
            "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors",
        )

-        assert vae_default.config.scaling_factor == 0.18125
+        assert vae_default.config.scaling_factor == 0.18215
        assert vae_default.config.sample_size == 512
        assert vae_default.dtype == torch.float32

@@ -50,9 +50,7 @@ class StableCascadeUNetModelSlowTests(unittest.TestCase):
        gc.collect()
        torch.cuda.empty_cache()

-        unet = StableCascadeUNet.from_pretrained(
-            "stabilityai/stable-cascade-prior", subfolder="prior", revision="refs/pr/2", variant="bf16"
-        )
+        unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior", variant="bf16")
        unet_config = unet.config
        del unet
        gc.collect()
@@ -74,9 +72,7 @@ class StableCascadeUNetModelSlowTests(unittest.TestCase):
        gc.collect()
        torch.cuda.empty_cache()

-        unet = StableCascadeUNet.from_pretrained(
-            "stabilityai/stable-cascade", subfolder="decoder", revision="refs/pr/44", variant="bf16"
-        )
+        unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder", variant="bf16")
        unet_config = unet.config
        del unet
        gc.collect()
@@ -211,7 +211,7 @@ class ControlNetPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": image,
        }

@@ -402,7 +402,7 @@ class StableDiffusionMultiControlNetPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": images,
        }

@@ -602,7 +602,7 @@ class StableDiffusionMultiControlNetOneModelPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": images,
        }

@@ -1092,6 +1092,13 @@ class ControlNetPipelineSlowTests(unittest.TestCase):
        for param_name, param_value in single_file_pipe.controlnet.config.items():
            if param_name in PARAMS_TO_IGNORE:
                continue
+
+            # This parameter doesn't appear to be loaded from the config.
+            # So when it is registered to config, it remains a tuple as this is the default in the class definition
+            # from_pretrained, does load from config and converts to a list when registering to config
+            if param_name == "conditioning_embedding_out_channels" and isinstance(param_value, tuple):
+                param_value = list(param_value)
+
            assert (
                pipe.controlnet.config[param_name] == param_value
            ), f"{param_name} differs between single file loading and pretrained loading"
@@ -164,7 +164,7 @@ class ControlNetImg2ImgPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": image,
            "control_image": control_image,
        }
@@ -313,7 +313,7 @@ class StableDiffusionMultiControlNetPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": image,
            "control_image": control_image,
        }
@@ -155,7 +155,7 @@ class ControlNetInpaintPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": image,
            "mask_image": mask_image,
            "control_image": control_image,
@@ -375,7 +375,7 @@ class MultiControlNetInpaintPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": image,
            "mask_image": mask_image,
            "control_image": control_image,
@@ -172,7 +172,7 @@ class ControlNetPipelineSDXLFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": init_image,
            "mask_image": mask_image,
            "control_image": control_image,
@@ -1002,6 +1002,11 @@ class ControlNetSDXLPipelineSlowTests(unittest.TestCase):
        for param_name, param_value in single_file_pipe.unet.config.items():
            if param_name in PARAMS_TO_IGNORE:
                continue
+
+            # Upcast attention might be set to None in a config file, which is incorrect. It should default to False in the model
+            if param_name == "upcast_attention" and pipe.unet.config[param_name] is None:
+                pipe.unet.config[param_name] = False
+
            assert (
                pipe.unet.config[param_name] == param_value
            ), f"{param_name} differs between single file loading and pretrained loading"
@@ -163,7 +163,7 @@ class ControlNetPipelineSDXLImg2ImgFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "image": image,
            "control_image": image,
        }
@@ -63,7 +63,7 @@ class DDIMPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
            "batch_size": 1,
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -113,7 +113,7 @@ class DDIMPipelineIntegrationTests(unittest.TestCase):
        ddim.set_progress_bar_config(disable=None)

        generator = torch.manual_seed(0)
-        image = ddim(generator=generator, eta=0.0, output_type="numpy").images
+        image = ddim(generator=generator, eta=0.0, output_type="np").images

        image_slice = image[0, -3:, -3:, -1]

@@ -133,7 +133,7 @@ class DDIMPipelineIntegrationTests(unittest.TestCase):
        ddpm.set_progress_bar_config(disable=None)

        generator = torch.manual_seed(0)
-        image = ddpm(generator=generator, output_type="numpy").images
+        image = ddpm(generator=generator, output_type="np").images

        image_slice = image[0, -3:, -3:, -1]

@@ -50,10 +50,10 @@ class DDPMPipelineFastTests(unittest.TestCase):
        ddpm.set_progress_bar_config(disable=None)

        generator = torch.Generator(device=device).manual_seed(0)
-        image = ddpm(generator=generator, num_inference_steps=2, output_type="numpy").images
+        image = ddpm(generator=generator, num_inference_steps=2, output_type="np").images

        generator = torch.Generator(device=device).manual_seed(0)
-        image_from_tuple = ddpm(generator=generator, num_inference_steps=2, output_type="numpy", return_dict=False)[0]
+        image_from_tuple = ddpm(generator=generator, num_inference_steps=2, output_type="np", return_dict=False)[0]

        image_slice = image[0, -3:, -3:, -1]
        image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
@@ -75,10 +75,10 @@ class DDPMPipelineFastTests(unittest.TestCase):
        ddpm.set_progress_bar_config(disable=None)

        generator = torch.manual_seed(0)
-        image = ddpm(generator=generator, num_inference_steps=2, output_type="numpy").images
+        image = ddpm(generator=generator, num_inference_steps=2, output_type="np").images

        generator = torch.manual_seed(0)
-        image_eps = ddpm(generator=generator, num_inference_steps=2, output_type="numpy")[0]
+        image_eps = ddpm(generator=generator, num_inference_steps=2, output_type="np")[0]

        image_slice = image[0, -3:, -3:, -1]
        image_eps_slice = image_eps[0, -3:, -3:, -1]
@@ -102,7 +102,7 @@ class DDPMPipelineIntegrationTests(unittest.TestCase):
        ddpm.set_progress_bar_config(disable=None)

        generator = torch.manual_seed(0)
-        image = ddpm(generator=generator, output_type="numpy").images
+        image = ddpm(generator=generator, output_type="np").images

        image_slice = image[0, -3:, -3:, -1]

@@ -50,7 +50,7 @@ class IFPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, unittest.T
            "prompt": "A painting of a squirrel eating a burger",
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -55,7 +55,7 @@ class IFImg2ImgPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin, uni
            "image": image,
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -57,7 +57,7 @@ class IFImg2ImgSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipelineT
            "original_image": original_image,
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -57,7 +57,7 @@ class IFInpaintingPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMixin,
            "mask_image": mask_image,
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -59,7 +59,7 @@ class IFInpaintingSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipeli
            "mask_image": mask_image,
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -52,7 +52,7 @@ class IFSuperResolutionPipelineFastTests(PipelineTesterMixin, IFPipelineTesterMi
            "image": image,
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -74,7 +74,7 @@ class DiTPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
            "class_labels": [1],
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -113,7 +113,7 @@ class LDMTextToImagePipelineFastTests(PipelineTesterMixin, unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -153,7 +153,7 @@ class LDMTextToImagePipelineSlowTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -189,7 +189,7 @@ class LDMTextToImagePipelineNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 50,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -84,7 +84,7 @@ class LDMSuperResolutionPipelineFastTests(unittest.TestCase):
        init_image = self.dummy_image.to(device)

        generator = torch.Generator(device=device).manual_seed(0)
-        image = ldm(image=init_image, generator=generator, num_inference_steps=2, output_type="numpy").images
+        image = ldm(image=init_image, generator=generator, num_inference_steps=2, output_type="np").images

        image_slice = image[0, -3:, -3:, -1]

@@ -109,7 +109,7 @@ class LDMSuperResolutionPipelineFastTests(unittest.TestCase):

        init_image = self.dummy_image.to(torch_device)

-        image = ldm(init_image, num_inference_steps=2, output_type="numpy").images
+        image = ldm(init_image, num_inference_steps=2, output_type="np").images

        assert image.shape == (1, 64, 64, 3)

@@ -128,7 +128,7 @@ class LDMSuperResolutionPipelineIntegrationTests(unittest.TestCase):
        ldm.set_progress_bar_config(disable=None)

        generator = torch.manual_seed(0)
-        image = ldm(image=init_image, generator=generator, num_inference_steps=20, output_type="numpy").images
+        image = ldm(image=init_image, generator=generator, num_inference_steps=20, output_type="np").images

        image_slice = image[0, -3:, -3:, -1]

@@ -117,7 +117,7 @@ class PaintByExamplePipelineFastTests(PipelineTesterMixin, unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -49,10 +49,10 @@ class PNDMPipelineFastTests(unittest.TestCase):
        pndm.set_progress_bar_config(disable=None)

        generator = torch.manual_seed(0)
-        image = pndm(generator=generator, num_inference_steps=20, output_type="numpy").images
+        image = pndm(generator=generator, num_inference_steps=20, output_type="np").images

        generator = torch.manual_seed(0)
-        image_from_tuple = pndm(generator=generator, num_inference_steps=20, output_type="numpy", return_dict=False)[0]
+        image_from_tuple = pndm(generator=generator, num_inference_steps=20, output_type="np", return_dict=False)[0]

        image_slice = image[0, -3:, -3:, -1]
        image_from_tuple_slice = image_from_tuple[0, -3:, -3:, -1]
@@ -77,7 +77,7 @@ class PNDMPipelineIntegrationTests(unittest.TestCase):
        pndm.to(torch_device)
        pndm.set_progress_bar_config(disable=None)
        generator = torch.manual_seed(0)
-        image = pndm(generator=generator, output_type="numpy").images
+        image = pndm(generator=generator, output_type="np").images

        image_slice = image[0, -3:, -3:, -1]

@@ -21,13 +21,13 @@ import torch
 from transformers import CLIPTextConfig, CLIPTextModelWithProjection, CLIPTokenizer

 from diffusers import DDPMWuerstchenScheduler, StableCascadeDecoderPipeline
-from diffusers.image_processor import VaeImageProcessor
 from diffusers.models import StableCascadeUNet
 from diffusers.pipelines.wuerstchen import PaellaVQModel
 from diffusers.utils.testing_utils import (
    enable_full_determinism,
-    load_image,
+    load_numpy,
    load_pt,
+    numpy_cosine_similarity_distance,
    require_torch_gpu,
    skip_mps,
    slow,
@@ -258,7 +258,7 @@ class StableCascadeDecoderPipelineIntegrationTests(unittest.TestCase):

    def test_stable_cascade_decoder(self):
        pipe = StableCascadeDecoderPipeline.from_pretrained(
-            "diffusers/StableCascade-decoder", torch_dtype=torch.bfloat16
+            "stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16
        )
        pipe.enable_model_cpu_offload()
        pipe.set_progress_bar_config(disable=None)
@@ -271,18 +271,16 @@ class StableCascadeDecoderPipelineIntegrationTests(unittest.TestCase):
        )

        image = pipe(
-            prompt=prompt, image_embeddings=image_embedding, num_inference_steps=10, generator=generator
+            prompt=prompt,
+            image_embeddings=image_embedding,
+            output_type="np",
+            num_inference_steps=2,
+            generator=generator,
        ).images[0]

-        assert image.size == (1024, 1024)
-
-        expected_image = load_image(
-            "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/t2i.png"
+        assert image.shape == (1024, 1024, 3)
+        expected_image = load_numpy(
+            "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/stable_cascade_decoder_image.npy"
        )
-
-        image_processor = VaeImageProcessor()
-
-        image_np = image_processor.pil_to_numpy(image)
-        expected_image_np = image_processor.pil_to_numpy(expected_image)
-
-        self.assertTrue(np.allclose(image_np, expected_image_np, atol=53e-2))
+        max_diff = numpy_cosine_similarity_distance(image.flatten(), expected_image.flatten())
+        assert max_diff < 1e-4
@@ -29,7 +29,8 @@ from diffusers.models.attention_processor import LoRAAttnProcessor, LoRAAttnProc
 from diffusers.utils.import_utils import is_peft_available
 from diffusers.utils.testing_utils import (
    enable_full_determinism,
-    load_pt,
+    load_numpy,
+    numpy_cosine_similarity_distance,
    require_peft_backend,
    require_torch_gpu,
    skip_mps,
@@ -319,7 +320,9 @@ class StableCascadePriorPipelineIntegrationTests(unittest.TestCase):
        torch.cuda.empty_cache()

    def test_stable_cascade_prior(self):
-        pipe = StableCascadePriorPipeline.from_pretrained("diffusers/StableCascade-prior", torch_dtype=torch.bfloat16)
+        pipe = StableCascadePriorPipeline.from_pretrained(
+            "stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16
+        )
        pipe.enable_model_cpu_offload()
        pipe.set_progress_bar_config(disable=None)

@@ -327,17 +330,12 @@ class StableCascadePriorPipelineIntegrationTests(unittest.TestCase):

        generator = torch.Generator(device="cpu").manual_seed(0)

-        output = pipe(prompt, num_inference_steps=10, generator=generator)
+        output = pipe(prompt, num_inference_steps=2, output_type="np", generator=generator)
        image_embedding = output.image_embeddings
-
-        expected_image_embedding = load_pt(
-            "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/image_embedding.pt"
+        expected_image_embedding = load_numpy(
+            "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_cascade/stable_cascade_prior_image_embeddings.npy"
        )
-
        assert image_embedding.shape == (1, 16, 24, 24)

-        self.assertTrue(
-            np.allclose(
-                image_embedding.cpu().float().numpy(), expected_image_embedding.cpu().float().numpy(), atol=5e-2
-            )
-        )
+        max_diff = numpy_cosine_similarity_distance(image_embedding.flatten(), expected_image_embedding.flatten())
+        assert max_diff < 1e-4
@@ -46,7 +46,7 @@ class OnnxStableDiffusionPipelineFastTests(OnnxPipelineTesterMixin, unittest.Tes
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -55,7 +55,7 @@ class OnnxStableDiffusionImg2ImgPipelineFastTests(OnnxPipelineTesterMixin, unitt
            "num_inference_steps": 3,
            "strength": 0.75,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -55,7 +55,7 @@ class OnnxStableDiffusionUpscalePipelineFastTests(OnnxPipelineTesterMixin, unitt
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -775,7 +775,7 @@ class StableDiffusionPipelineSlowTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -950,7 +950,7 @@ class StableDiffusionPipelineSlowTests(unittest.TestCase):
            generator=generator,
            guidance_scale=7.5,
            num_inference_steps=2,
-            output_type="numpy",
+            output_type="np",
        )
        image_chunked = output_chunked.images

@@ -966,7 +966,7 @@ class StableDiffusionPipelineSlowTests(unittest.TestCase):
            generator=generator,
            guidance_scale=7.5,
            num_inference_steps=2,
-            output_type="numpy",
+            output_type="np",
        )
        image = output.images

@@ -179,7 +179,7 @@ class StableDiffusionImg2ImgPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -199,7 +199,7 @@ class StableDiffusionInpaintPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -470,7 +470,7 @@ class StableDiffusionSimpleInpaintPipelineFastTests(StableDiffusionInpaintPipeli
            "generator": [generator1, generator2],
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -586,7 +586,7 @@ class StableDiffusionInpaintPipelineSlowTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -847,7 +847,7 @@ class StableDiffusionInpaintPipelineAsymmetricAutoencoderKLSlowTests(unittest.Te
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -1072,7 +1072,7 @@ class StableDiffusionInpaintPipelineNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 50,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -131,7 +131,7 @@ class StableDiffusionInstructPix2PixPipelineFastTests(
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
            "image_guidance_scale": 1,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -288,7 +288,7 @@ class StableDiffusionInstructPix2PixPipelineSlowTests(unittest.TestCase):
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
            "image_guidance_scale": 1.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -151,7 +151,7 @@ class StableDiffusion2PipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -336,7 +336,7 @@ class StableDiffusion2PipelineSlowTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -557,7 +557,7 @@ class StableDiffusion2PipelineNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 50,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -138,7 +138,7 @@ class StableDiffusionAttendAndExcitePipelineFastTests(
            "generator": generator,
            "num_inference_steps": 1,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "max_iter_to_alter": 2,
            "thresholds": {0: 0.7},
        }
@@ -225,7 +225,7 @@ class StableDiffusionAttendAndExcitePipelineIntegrationTests(unittest.TestCase):
            generator=generator,
            num_inference_steps=5,
            max_iter_to_alter=5,
-            output_type="numpy",
+            output_type="np",
        ).images[0]

        expected_image = load_numpy(
@@ -174,7 +174,7 @@ class StableDiffusionDepth2ImgPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -395,7 +395,7 @@ class StableDiffusionDepth2ImgPipelineSlowTests(unittest.TestCase):
            "num_inference_steps": 3,
            "strength": 0.75,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -534,7 +534,7 @@ class StableDiffusionImg2ImgPipelineNightlyTests(unittest.TestCase):
            "num_inference_steps": 3,
            "strength": 0.75,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -143,7 +143,7 @@ class StableDiffusionDiffEditPipelineFastTests(PipelineLatentTesterMixin, Pipeli
            "num_inference_steps": 2,
            "inpaint_strength": 1.0,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -165,7 +165,7 @@ class StableDiffusionDiffEditPipelineFastTests(PipelineLatentTesterMixin, Pipeli
            "num_maps_per_mask": 2,
            "mask_encode_strength": 1.0,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }

        return inputs
@@ -186,7 +186,7 @@ class StableDiffusionDiffEditPipelineFastTests(PipelineLatentTesterMixin, Pipeli
            "inpaint_strength": 1.0,
            "guidance_scale": 6.0,
            "decode_latents": True,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -417,7 +417,7 @@ class StableDiffusionDiffEditPipelineNightlyTests(unittest.TestCase):
            negative_prompt=source_prompt,
            inpaint_strength=0.7,
            num_inference_steps=25,
-            output_type="numpy",
+            output_type="np",
        ).images[0]

        expected_image = (
@@ -129,7 +129,7 @@ class StableDiffusion2InpaintPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -155,7 +155,7 @@ class StableDiffusionLatentUpscalePipelineFastTests(
            "image": self.dummy_image.cpu(),
            "generator": generator,
            "num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -522,7 +522,7 @@ class StableDiffusionUpscalePipelineIntegrationTests(unittest.TestCase):
        ckpt_path = (
            "https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.safetensors"
        )
-        single_file_pipe = StableDiffusionUpscalePipeline.from_single_file(ckpt_path)
+        single_file_pipe = StableDiffusionUpscalePipeline.from_single_file(ckpt_path, load_safety_checker=True)

        for param_name, param_value in single_file_pipe.text_encoder.config.to_dict().items():
            if param_name in ["torch_dtype", "architectures", "_name_or_path"]:
@@ -540,12 +540,13 @@ class StableDiffusionUpscalePipelineIntegrationTests(unittest.TestCase):
        for param_name, param_value in single_file_pipe.vae.config.items():
            if param_name in PARAMS_TO_IGNORE:
                continue
-
-            # The sample_size parameter for the VAE is incorrectly configured on the hub
-            # It must be 512, but it is 256 on the hub
-            if param_name == "sample_size":
-                pipe.vae.config[param_name] = param_value
-
            assert (
                pipe.vae.config[param_name] == param_value
            ), f"{param_name} differs between single file loading and pretrained loading"
+
+        for param_name, param_value in single_file_pipe.safety_checker.config.to_dict().items():
+            if param_name in PARAMS_TO_IGNORE:
+                continue
+            assert (
+                pipe.safety_checker.config.to_dict()[param_name] == param_value
+            ), f"{param_name} differs between single file loading and pretrained loading"
@@ -308,7 +308,7 @@ class StableDiffusion2VPredictionPipelineIntegrationTests(unittest.TestCase):
        prompt = "A painting of a squirrel eating a burger"
        generator = torch.manual_seed(0)

-        output = sd_pipe([prompt], generator=generator, num_inference_steps=5, output_type="numpy")
+        output = sd_pipe([prompt], generator=generator, num_inference_steps=5, output_type="np")
        image = output.images

        image_slice = image[0, 253:256, 253:256, -1]
@@ -335,7 +335,7 @@ class StableDiffusion2VPredictionPipelineIntegrationTests(unittest.TestCase):
        prompt = "a photograph of an astronaut riding a horse"
        generator = torch.manual_seed(0)
        image = sd_pipe(
-            [prompt], generator=generator, guidance_scale=7.5, num_inference_steps=5, output_type="numpy"
+            [prompt], generator=generator, guidance_scale=7.5, num_inference_steps=5, output_type="np"
        ).images

        image_slice = image[0, 253:256, 253:256, -1]
@@ -357,7 +357,7 @@ class StableDiffusion2VPredictionPipelineIntegrationTests(unittest.TestCase):
        pipe.enable_attention_slicing()
        generator = torch.manual_seed(0)
        output_chunked = pipe(
-            [prompt], generator=generator, guidance_scale=7.5, num_inference_steps=10, output_type="numpy"
+            [prompt], generator=generator, guidance_scale=7.5, num_inference_steps=10, output_type="np"
        )
        image_chunked = output_chunked.images

@@ -369,7 +369,7 @@ class StableDiffusion2VPredictionPipelineIntegrationTests(unittest.TestCase):
        # disable slicing
        pipe.disable_attention_slicing()
        generator = torch.manual_seed(0)
-        output = pipe([prompt], generator=generator, guidance_scale=7.5, num_inference_steps=10, output_type="numpy")
+        output = pipe([prompt], generator=generator, guidance_scale=7.5, num_inference_steps=10, output_type="np")
        image = output.images

        # make sure that more than 3.0 GB is allocated
@@ -246,7 +246,7 @@ class AdapterTests:
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -117,7 +117,7 @@ class StableDiffusionImageVariationPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -293,7 +293,7 @@ class StableDiffusionImageVariationPipelineNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 50,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -107,7 +107,7 @@ class StableDiffusionLDM3DPipelineFastTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -222,7 +222,7 @@ class StableDiffusionLDM3DPipelineSlowTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -268,7 +268,7 @@ class StableDiffusionPipelineNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 50,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -105,7 +105,7 @@ class StableDiffusionPanoramaPipelineFastTests(PipelineLatentTesterMixin, Pipeli
            "width": None,
            "num_inference_steps": 1,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -263,7 +263,7 @@ class StableDiffusionPanoramaNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 7.5,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -290,7 +290,7 @@ class StableDiffusionXLAdapterPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 5.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -143,7 +143,7 @@ class StableDiffusionXLInstructPix2PixPipelineFastTests(
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
            "image_guidance_scale": 1,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -168,7 +168,7 @@ class StableUnCLIPPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "prior_num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -117,10 +117,10 @@ def _test_from_save_pretrained_dynamo(in_queue, out_queue, timeout):
            new_ddpm.to(torch_device)

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        image = ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+        image = ddpm(generator=generator, num_inference_steps=5, output_type="np").images

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        new_image = new_ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+        new_image = new_ddpm(generator=generator, num_inference_steps=5, output_type="np").images

        assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"
    except Exception:
@@ -363,12 +363,12 @@ class DownloadTests(unittest.TestCase):
        )
        pipe = pipe.to(torch_device)
        generator = torch.manual_seed(0)
-        out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="np").images

        pipe_2 = StableDiffusionPipeline.from_pretrained("hf-internal-testing/tiny-stable-diffusion-torch")
        pipe_2 = pipe_2.to(torch_device)
        generator = torch.manual_seed(0)
-        out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+        out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="np").images

        assert np.max(np.abs(out - out_2)) < 1e-3

@@ -379,7 +379,7 @@ class DownloadTests(unittest.TestCase):
        )
        pipe = pipe.to(torch_device)
        generator = torch.manual_seed(0)
-        out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="np").images

        with tempfile.TemporaryDirectory() as tmpdirname:
            pipe.save_pretrained(tmpdirname)
@@ -388,7 +388,7 @@ class DownloadTests(unittest.TestCase):

            generator = torch.manual_seed(0)

-            out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+            out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="np").images

        assert np.max(np.abs(out - out_2)) < 1e-3

@@ -398,7 +398,7 @@ class DownloadTests(unittest.TestCase):
        pipe = pipe.to(torch_device)

        generator = torch.manual_seed(0)
-        out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="np").images

        with tempfile.TemporaryDirectory() as tmpdirname:
            pipe.save_pretrained(tmpdirname)
@@ -407,7 +407,7 @@ class DownloadTests(unittest.TestCase):

            generator = torch.manual_seed(0)

-            out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+            out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="np").images

        assert np.max(np.abs(out - out_2)) < 1e-3

@@ -590,7 +590,7 @@ class DownloadTests(unittest.TestCase):
                )
                pipe = pipe.to(torch_device)
                generator = torch.manual_seed(0)
-                out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+                out = pipe(prompt, num_inference_steps=2, generator=generator, output_type="np").images

                with tempfile.TemporaryDirectory() as tmpdirname:
                    pipe.save_pretrained(tmpdirname)
@@ -601,7 +601,7 @@ class DownloadTests(unittest.TestCase):

                generator = torch.manual_seed(0)

-                out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="numpy").images
+                out_2 = pipe_2(prompt, num_inference_steps=2, generator=generator, output_type="np").images

                assert np.max(np.abs(out - out_2)) < 1e-3

@@ -626,7 +626,7 @@ class DownloadTests(unittest.TestCase):
            assert pipe._maybe_convert_prompt("<*>", pipe.tokenizer) == "<*>"

            prompt = "hey <*>"
-            out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+            out = pipe(prompt, num_inference_steps=1, output_type="np").images
            assert out.shape == (1, 128, 128, 3)

        # single token load local with weight name
@@ -642,7 +642,7 @@ class DownloadTests(unittest.TestCase):
            assert pipe._maybe_convert_prompt("<**>", pipe.tokenizer) == "<**>"

            prompt = "hey <**>"
-            out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+            out = pipe(prompt, num_inference_steps=1, output_type="np").images
            assert out.shape == (1, 128, 128, 3)

        # multi token load
@@ -665,7 +665,7 @@ class DownloadTests(unittest.TestCase):
            assert pipe._maybe_convert_prompt("<***>", pipe.tokenizer) == "<***> <***>_1 <***>_2"

            prompt = "hey <***>"
-            out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+            out = pipe(prompt, num_inference_steps=1, output_type="np").images
            assert out.shape == (1, 128, 128, 3)

        # multi token load a1111
@@ -693,7 +693,7 @@ class DownloadTests(unittest.TestCase):
            assert pipe._maybe_convert_prompt("<****>", pipe.tokenizer) == "<****> <****>_1 <****>_2"

            prompt = "hey <****>"
-            out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+            out = pipe(prompt, num_inference_steps=1, output_type="np").images
            assert out.shape == (1, 128, 128, 3)

        # multi embedding load
@@ -718,7 +718,7 @@ class DownloadTests(unittest.TestCase):
                assert pipe._maybe_convert_prompt("<******>", pipe.tokenizer) == "<******>"

                prompt = "hey <*****> <******>"
-                out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+                out = pipe(prompt, num_inference_steps=1, output_type="np").images
                assert out.shape == (1, 128, 128, 3)

        # single token state dict load
@@ -731,7 +731,7 @@ class DownloadTests(unittest.TestCase):
        assert pipe._maybe_convert_prompt("<x>", pipe.tokenizer) == "<x>"

        prompt = "hey <x>"
-        out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=1, output_type="np").images
        assert out.shape == (1, 128, 128, 3)

        # multi embedding state dict load
@@ -751,7 +751,7 @@ class DownloadTests(unittest.TestCase):
        assert pipe._maybe_convert_prompt("<xxxxxx>", pipe.tokenizer) == "<xxxxxx>"

        prompt = "hey <xxxxx> <xxxxxx>"
-        out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=1, output_type="np").images
        assert out.shape == (1, 128, 128, 3)

        # auto1111 multi-token state dict load
@@ -777,7 +777,7 @@ class DownloadTests(unittest.TestCase):
        assert pipe._maybe_convert_prompt("<xxxx>", pipe.tokenizer) == "<xxxx> <xxxx>_1 <xxxx>_2"

        prompt = "hey <xxxx>"
-        out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=1, output_type="np").images
        assert out.shape == (1, 128, 128, 3)

        # multiple references to multi embedding
@@ -789,7 +789,7 @@ class DownloadTests(unittest.TestCase):
        )

        prompt = "hey <cat> <cat>"
-        out = pipe(prompt, num_inference_steps=1, output_type="numpy").images
+        out = pipe(prompt, num_inference_steps=1, output_type="np").images
        assert out.shape == (1, 128, 128, 3)

    def test_text_inversion_multi_tokens(self):
@@ -1739,10 +1739,10 @@ class PipelineSlowTests(unittest.TestCase):
            new_ddpm.to(torch_device)

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        image = ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+        image = ddpm(generator=generator, num_inference_steps=5, output_type="np").images

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        new_image = new_ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+        new_image = new_ddpm(generator=generator, num_inference_steps=5, output_type="np").images

        assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"

@@ -1765,10 +1765,10 @@ class PipelineSlowTests(unittest.TestCase):
        ddpm_from_hub.set_progress_bar_config(disable=None)

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        image = ddpm(generator=generator, num_inference_steps=5, output_type="numpy").images
+        image = ddpm(generator=generator, num_inference_steps=5, output_type="np").images

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        new_image = ddpm_from_hub(generator=generator, num_inference_steps=5, output_type="numpy").images
+        new_image = ddpm_from_hub(generator=generator, num_inference_steps=5, output_type="np").images

        assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"

@@ -1788,10 +1788,10 @@ class PipelineSlowTests(unittest.TestCase):
        ddpm_from_hub_custom_model.set_progress_bar_config(disable=None)

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        image = ddpm_from_hub_custom_model(generator=generator, num_inference_steps=5, output_type="numpy").images
+        image = ddpm_from_hub_custom_model(generator=generator, num_inference_steps=5, output_type="np").images

        generator = torch.Generator(device=torch_device).manual_seed(0)
-        new_image = ddpm_from_hub(generator=generator, num_inference_steps=5, output_type="numpy").images
+        new_image = ddpm_from_hub(generator=generator, num_inference_steps=5, output_type="np").images

        assert np.abs(image - new_image).max() < 1e-5, "Models don't give the same forward pass"

@@ -1803,7 +1803,7 @@ class PipelineSlowTests(unittest.TestCase):
        pipe.to(torch_device)
        pipe.set_progress_bar_config(disable=None)

-        images = pipe(output_type="numpy").images
+        images = pipe(output_type="np").images
        assert images.shape == (1, 32, 32, 3)
        assert isinstance(images, np.ndarray)

@@ -1878,7 +1878,7 @@ class PipelineSlowTests(unittest.TestCase):
        generator = [torch.Generator(device="cpu").manual_seed(33) for _ in range(prompt_embeds.shape[0])]

        images = pipe(
-            prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20, output_type="numpy"
+            prompt_embeds=prompt_embeds, generator=generator, num_inference_steps=20, output_type="np"
        ).images

        for i, image in enumerate(images):
@@ -1916,7 +1916,7 @@ class PipelineNightlyTests(unittest.TestCase):
        ddim.set_progress_bar_config(disable=None)

        generator = torch.Generator(device=torch_device).manual_seed(seed)
-        ddpm_images = ddpm(batch_size=2, generator=generator, output_type="numpy").images
+        ddpm_images = ddpm(batch_size=2, generator=generator, output_type="np").images

        generator = torch.Generator(device=torch_device).manual_seed(seed)
        ddim_images = ddim(
@@ -1924,7 +1924,7 @@ class PipelineNightlyTests(unittest.TestCase):
            generator=generator,
            num_inference_steps=1000,
            eta=1.0,
-            output_type="numpy",
+            output_type="np",
            use_clipped_model_output=True,  # Need this to make DDIM match DDPM
        ).images

@@ -233,7 +233,7 @@ class UnCLIPPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
            "prior_num_inference_steps": 2,
            "decoder_num_inference_steps": 2,
            "super_res_num_inference_steps": 2,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -158,7 +158,7 @@ class UniDiffuserPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        return inputs

@@ -199,7 +199,7 @@ class UniDiffuserPipelineFastTests(
            "generator": generator,
            "num_inference_steps": 2,
            "guidance_scale": 6.0,
-            "output_type": "numpy",
+            "output_type": "np",
            "prompt_latents": latents.get("prompt_latents"),
            "vae_latents": latents.get("vae_latents"),
            "clip_latents": latents.get("clip_latents"),
@@ -590,7 +590,7 @@ class UniDiffuserPipelineSlowTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 8.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        if generate_latents:
            latents = self.get_fixed_latents(device, seed=seed)
@@ -706,7 +706,7 @@ class UniDiffuserPipelineNightlyTests(unittest.TestCase):
            "generator": generator,
            "num_inference_steps": 3,
            "guidance_scale": 8.0,
-            "output_type": "numpy",
+            "output_type": "np",
        }
        if generate_latents:
            latents = self.get_fixed_latents(device, seed=seed)
Author	SHA1	Message	Date
Sayak Paul	47ee2a737a	Merge branch 'main' into controlnet-test-fixes	2024-03-15 12:30:25 +05:30
Sayak Paul	e0e9f81971	add: torch to the pypi step. (#7328 )	2024-03-15 12:28:12 +05:30
M. Tolga Cangöz	5d848ec07c	[`Tests`] Update a deprecated parameter in test files and fix several typos (#7277 ) * Add properties and `IPAdapterTesterMixin` tests for `StableDiffusionPanoramaPipeline` * Fix variable name typo and update comments * Update deprecated `output_type="numpy"` to "np" in test files * Discard changes to src/diffusers/pipelines/stable_diffusion_panorama/pipeline_stable_diffusion_panorama.py * Update test_stable_diffusion_panorama.py * Update numbers in README.md * Update get_guidance_scale_embedding method to use timesteps instead of w * Update number of checkpoints in README.md * Add type hints and fix var name * Fix PyTorch's convention for inplace functions * Fix a typo * Revert "Fix PyTorch's convention for inplace functions" This reverts commit `74350cf65b`. * Fix typos * Indent * Refactor get_guidance_scale_embedding method in LEditsPPPipelineStableDiffusionXL class	2024-03-14 12:17:35 -07:00
Dhruv Nair	4974b84564	Update Cascade Tests (#7324 ) * update * update * update	2024-03-14 20:51:22 +05:30
Linoy Tsaban	83062fb872	[Advanced DreamBooth LoRA SDXL] Support EDM-style training (follow up of #7126 ) (#7182 ) * add edm style training * style * finish adding edm training feature * import fix * fix latents mean * minor adjustments * add edm to readme * style * fix autocast and scheduler config issues when using edm * style --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2024-03-14 18:40:14 +05:30
Suraj Patil	b6d7e31d10	add edm schedulers in doc (#7319 ) * add edm schedulers in doc * add in toctree * address reviewe comments	2024-03-14 11:52:25 +01:00
Dhruv Nair	94fc2d3fe6	update	2024-03-14 06:53:52 +00:00
Dhruv Nair	503e359204	update	2024-03-14 06:47:29 +00:00
Anatoly Belikov	53e9aacc10	log loss per image (#7278 ) * log loss per image * add commandline param for per image loss logging * style * debug-loss -> debug_loss --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2024-03-14 11:41:43 +05:30
Dhruv Nair	41424466e3	[Tests] Fix incorrect constant in VAE scaling test. (#7301 ) update	2024-03-14 10:24:01 +05:30
Sayak Paul	95de1981c9	add: pytest log installation (#7313 )	2024-03-14 10:01:16 +05:30
Kenneth Gerald Hamilton	0b45b58867	update get_order_list if statement (#7309 ) * update get_order_list if statement * revery	2024-03-13 18:29:42 -10:00