update

2024-08-26 13:10:55 +00:00 · 2024-08-26 06:22:50 +00:00
34 changed files with 230 additions and 961 deletions
@@ -79,7 +79,7 @@ jobs:
          python utils/print_env.py
      - name: Pipeline CUDA Test
        env:
-          HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
          # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
          CUBLAS_WORKSPACE_CONFIG: :16:8
        run: |
@@ -139,7 +139,7 @@ jobs:
    - name: Run nightly PyTorch CUDA tests for non-pipeline modules
      if: ${{ matrix.module != 'examples'}}
      env:
-        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
        CUBLAS_WORKSPACE_CONFIG: :16:8
      run: |
@@ -152,7 +152,7 @@ jobs:
    - name: Run nightly example tests with Torch
      if: ${{ matrix.module == 'examples' }}
      env:
-        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
        CUBLAS_WORKSPACE_CONFIG: :16:8
      run: |
@@ -209,7 +209,7 @@ jobs:

    - name: Run nightly Flax TPU tests
      env:
-        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        python -m pytest -n 0 \
          -s -v -k "Flax" \
@@ -264,7 +264,7 @@ jobs:

    - name: Run Nightly ONNXRuntime CUDA tests
      env:
-        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
          -s -v -k "Onnx" \
@@ -77,21 +77,10 @@ CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds o
 - `pipe.enable_model_cpu_offload()`:
  - Without enabling cpu offloading, memory usage is `33 GB`
  - With enabling cpu offloading, memory usage is `19 GB`
- `pipe.enable_sequential_cpu_offload()`:
-  - Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
-  - When enabled, memory usage is under `4 GB`
 - `pipe.vae.enable_tiling()`:
  - With enabling cpu offloading and tiling, memory usage is `11 GB`
 - `pipe.vae.enable_slicing()`

-### Quantized inference
-
-[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs!
-
-It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below.
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
-
 ## CogVideoXPipeline

 [[autodoc]] CogVideoXPipeline
@@ -30,64 +30,63 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an

 | Pipeline | Tasks |
 |---|---|
-| [aMUSEd](amused) | text2image |
+| [AltDiffusion](alt_diffusion) | image2image |
 | [AnimateDiff](animatediff) | text2video |
 | [Attend-and-Excite](attend_and_excite) | text2image |
+| [Audio Diffusion](audio_diffusion) | image2audio |
 | [AudioLDM](audioldm) | text2audio |
 | [AudioLDM2](audioldm2) | text2audio |
-| [AuraFlow](auraflow) | text2image |
 | [BLIP Diffusion](blip_diffusion) | text2image |
-| [CogVideoX](cogvideox) | text2video |
 | [Consistency Models](consistency_models) | unconditional image generation |
 | [ControlNet](controlnet) | text2image, image2image, inpainting |
-| [ControlNet with Flux.1](controlnet_flux) | text2image |
-| [ControlNet with Hunyuan-DiT](controlnet_hunyuandit) | text2image |
-| [ControlNet with Stable Diffusion 3](controlnet_sd3) | text2image |
 | [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image |
 | [ControlNet-XS](controlnetxs) | text2image |
 | [ControlNet-XS with Stable Diffusion XL](controlnetxs_sdxl) | text2image |
+| [Cycle Diffusion](cycle_diffusion) | image2image |
 | [Dance Diffusion](dance_diffusion) | unconditional audio generation |
 | [DDIM](ddim) | unconditional image generation |
 | [DDPM](ddpm) | unconditional image generation |
 | [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution |
 | [DiffEdit](diffedit) | inpainting |
 | [DiT](dit) | text2image |
-| [Flux](flux) | text2image |
-| [Hunyuan-DiT](hunyuandit) | text2image |
-| [I2VGen-XL](i2vgenxl) | text2video |
+| [GLIGEN](stable_diffusion/gligen) | text2image |
 | [InstructPix2Pix](pix2pix) | image editing |
 | [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation |
 | [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting |
 | [Kandinsky 3](kandinsky3) | text2image, image2image |
-| [Kolors](kolors) | text2image |
 | [Latent Consistency Models](latent_consistency_models) | text2image |
 | [Latent Diffusion](latent_diffusion) | text2image, super-resolution |
-| [Latte](latte) | text2image |
+| [LDM3D](stable_diffusion/ldm3d_diffusion) | text2image, text-to-3D, text-to-pano, upscaling |
 | [LEDITS++](ledits_pp) | image editing |
-| [Lumina-T2X](lumina) | text2image |
-| [Marigold](marigold) | depth |
 | [MultiDiffusion](panorama) | text2image |
 | [MusicLDM](musicldm) | text2audio |
-| [PAG](pag) | text2image |
 | [Paint by Example](paint_by_example) | inpainting |
-| [PIA](pia) | image2video |
+| [ParaDiGMS](paradigms) | text2image |
+| [Pix2Pix Zero](pix2pix_zero) | image editing |
 | [PixArt-α](pixart) | text2image |
-| [PixArt-Σ](pixart_sigma) | text2image |
+| [PNDM](pndm) | unconditional image generation |
+| [RePaint](repaint) | inpainting |
+| [Score SDE VE](score_sde_ve) | unconditional image generation |
 | [Self-Attention Guidance](self_attention_guidance) | text2image |
 | [Semantic Guidance](semantic_stable_diffusion) | text2image |
 | [Shap-E](shap_e) | text-to-3D, image-to-3D |
+| [Spectrogram Diffusion](spectrogram_diffusion) |  |
 | [Stable Audio](stable_audio) | text2audio |
-| [Stable Cascade](stable_cascade) | text2image |
 | [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
+| [Stable Diffusion Model Editing](model_editing) | model editing |
 | [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting |
 | [Stable Diffusion XL Turbo](stable_diffusion/sdxl_turbo) | text2image, image2image, inpainting |
 | [Stable unCLIP](stable_unclip) | text2image, image variation |
+| [Stochastic Karras VE](stochastic_karras_ve) | unconditional image generation |
 | [T2I-Adapter](stable_diffusion/adapter) | text2image |
 | [Text2Video](text_to_video) | text2video, video2video |
 | [Text2Video-Zero](text_to_video_zero) | text2video |
 | [unCLIP](unclip) | text2image, image variation |
+| [Unconditional Latent Diffusion](latent_diffusion_uncond) | unconditional image generation |
 | [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation |
 | [Value-guided planning](value_guided_sampling) | value guided sampling |
+| [Versatile Diffusion](versatile_diffusion) | text2image, image variation |
+| [VQ Diffusion](vq_diffusion) | text2image |
 | [Wuerstchen](wuerstchen) | text2image |

 ## DiffusionPipeline
@@ -314,12 +314,11 @@ def save_new_embed(text_encoder, modifier_token_id, accelerator, args, output_di
    for x, y in zip(modifier_token_id, args.modifier_token):
        learned_embeds_dict = {}
        learned_embeds_dict[y] = learned_embeds[x]
+        filename = f"{output_dir}/{y}.bin"

        if safe_serialization:
-            filename = f"{output_dir}/{y}.safetensors"
            safetensors.torch.save_file(learned_embeds_dict, filename, metadata={"format": "pt"})
        else:
-            filename = f"{output_dir}/{y}.bin"
            torch.save(learned_embeds_dict, filename)


@@ -1041,22 +1040,17 @@ def main(args):
    )

    # Scheduler and math around the number of training steps.
-    # Check the PR https://github.com/huggingface/diffusers/pull/8312 for detailed explanation.
-    num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes
+    overrode_max_train_steps = False
+    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    if args.max_train_steps is None:
-        len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes)
-        num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps)
-        num_training_steps_for_scheduler = (
-            args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes
-        )
-    else:
-        num_training_steps_for_scheduler = args.max_train_steps * accelerator.num_processes
+        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
+        overrode_max_train_steps = True

    lr_scheduler = get_scheduler(
        args.lr_scheduler,
        optimizer=optimizer,
-        num_warmup_steps=num_warmup_steps_for_scheduler,
-        num_training_steps=num_training_steps_for_scheduler,
+        num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
+        num_training_steps=args.max_train_steps * accelerator.num_processes,
    )

    # Prepare everything with our `accelerator`.
@@ -1071,14 +1065,8 @@ def main(args):

    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
-    if args.max_train_steps is None:
+    if overrode_max_train_steps:
        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
-        if num_training_steps_for_scheduler != args.max_train_steps * accelerator.num_processes:
-            logger.warning(
-                f"The length of the 'train_dataloader' after 'accelerator.prepare' ({len(train_dataloader)}) does not match "
-                f"the expected length ({len_train_dataloader_after_sharding}) when the learning rate scheduler was created. "
-                f"This inconsistency may result in the learning rate scheduler not functioning properly."
-            )
    # Afterwards we recalculate our number of training epochs
    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)

@@ -842,7 +842,7 @@ class PromptDataset(Dataset):
        return example


-def tokenize_prompt(tokenizer, prompt, max_sequence_length):
+def tokenize_prompt(tokenizer, prompt, max_sequence_length=512):
    text_inputs = tokenizer(
        prompt,
        padding="max_length",
@@ -863,26 +863,20 @@ def _encode_prompt_with_t5(
    prompt=None,
    num_images_per_prompt=1,
    device=None,
-    text_input_ids=None,
 ):
    prompt = [prompt] if isinstance(prompt, str) else prompt
    batch_size = len(prompt)

-    if tokenizer is not None:
-        text_inputs = tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=max_sequence_length,
-            truncation=True,
-            return_length=False,
-            return_overflowing_tokens=False,
-            return_tensors="pt",
-        )
-        text_input_ids = text_inputs.input_ids
-    else:
-        if text_input_ids is None:
-            raise ValueError("text_input_ids must be provided when the tokenizer is not specified")
-
+    text_inputs = tokenizer(
+        prompt,
+        padding="max_length",
+        max_length=max_sequence_length,
+        truncation=True,
+        return_length=False,
+        return_overflowing_tokens=False,
+        return_tensors="pt",
+    )
+    text_input_ids = text_inputs.input_ids
    prompt_embeds = text_encoder(text_input_ids.to(device))[0]

    dtype = text_encoder.dtype
@@ -902,28 +896,22 @@ def _encode_prompt_with_clip(
    tokenizer,
    prompt: str,
    device=None,
-    text_input_ids=None,
    num_images_per_prompt: int = 1,
 ):
    prompt = [prompt] if isinstance(prompt, str) else prompt
    batch_size = len(prompt)

-    if tokenizer is not None:
-        text_inputs = tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=77,
-            truncation=True,
-            return_overflowing_tokens=False,
-            return_length=False,
-            return_tensors="pt",
-        )
-
-        text_input_ids = text_inputs.input_ids
-    else:
-        if text_input_ids is None:
-            raise ValueError("text_input_ids must be provided when the tokenizer is not specified")
+    text_inputs = tokenizer(
+        prompt,
+        padding="max_length",
+        max_length=77,
+        truncation=True,
+        return_overflowing_tokens=False,
+        return_length=False,
+        return_tensors="pt",
+    )

+    text_input_ids = text_inputs.input_ids
    prompt_embeds = text_encoder(text_input_ids.to(device), output_hidden_states=False)

    # Use pooled output of CLIPTextModel
@@ -944,19 +932,17 @@ def encode_prompt(
    max_sequence_length,
    device=None,
    num_images_per_prompt: int = 1,
-    text_input_ids_list=None,
 ):
    prompt = [prompt] if isinstance(prompt, str) else prompt
    batch_size = len(prompt)
    dtype = text_encoders[0].dtype
-    device = device if device is not None else text_encoders[1].device
+
    pooled_prompt_embeds = _encode_prompt_with_clip(
        text_encoder=text_encoders[0],
        tokenizer=tokenizers[0],
        prompt=prompt,
-        device=device,
+        device=device if device is not None else text_encoders[0].device,
        num_images_per_prompt=num_images_per_prompt,
-        text_input_ids=text_input_ids_list[0] if text_input_ids_list else None,
    )

    prompt_embeds = _encode_prompt_with_t5(
@@ -965,8 +951,7 @@ def encode_prompt(
        max_sequence_length=max_sequence_length,
        prompt=prompt,
        num_images_per_prompt=num_images_per_prompt,
-        device=device,
-        text_input_ids=text_input_ids_list[1] if text_input_ids_list else None,
+        device=device if device is not None else text_encoders[1].device,
    )

    text_ids = torch.zeros(batch_size, prompt_embeds.shape[1], 3).to(device=device, dtype=dtype)
@@ -1514,25 +1499,7 @@ def main(args):
                        )
                    else:
                        tokens_one = tokenize_prompt(tokenizer_one, prompts, max_sequence_length=77)
-                        tokens_two = tokenize_prompt(
-                            tokenizer_two, prompts, max_sequence_length=args.max_sequence_length
-                        )
-                        prompt_embeds, pooled_prompt_embeds, text_ids = encode_prompt(
-                            text_encoders=[text_encoder_one, text_encoder_two],
-                            tokenizers=[None, None],
-                            text_input_ids_list=[tokens_one, tokens_two],
-                            max_sequence_length=args.max_sequence_length,
-                            prompt=prompts,
-                        )
-                else:
-                    if args.train_text_encoder:
-                        prompt_embeds, pooled_prompt_embeds, text_ids = encode_prompt(
-                            text_encoders=[text_encoder_one, text_encoder_two],
-                            tokenizers=[None, None],
-                            text_input_ids_list=[tokens_one, tokens_two],
-                            max_sequence_length=args.max_sequence_length,
-                            prompt=args.instance_prompt,
-                        )
+                        tokens_two = tokenize_prompt(tokenizer_two, prompts, max_sequence_length=512)

                # Convert images to latent space
                model_input = vae.encode(pixel_values).latent_dist.sample()
@@ -1586,22 +1553,41 @@ def main(args):
                    guidance = None

                # Predict the noise residual
-                model_pred = transformer(
-                    hidden_states=packed_noisy_model_input,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
-                    timestep=timesteps / 1000,
-                    guidance=guidance,
-                    pooled_projections=pooled_prompt_embeds,
-                    encoder_hidden_states=prompt_embeds,
-                    txt_ids=text_ids,
-                    img_ids=latent_image_ids,
-                    return_dict=False,
-                )[0]
-                # upscaling height & width as discussed in https://github.com/huggingface/diffusers/pull/9257#discussion_r1731108042
+                if not args.train_text_encoder:
+                    model_pred = transformer(
+                        hidden_states=packed_noisy_model_input,
+                        # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                        timestep=timesteps / 1000,
+                        guidance=guidance,
+                        pooled_projections=pooled_prompt_embeds,
+                        encoder_hidden_states=prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        return_dict=False,
+                    )[0]
+                else:
+                    prompt_embeds, pooled_prompt_embeds, text_ids = encode_prompt(
+                        text_encoders=[text_encoder_one, text_encoder_two],
+                        tokenizers=None,
+                        prompt=None,
+                        text_input_ids_list=[tokens_one, tokens_two],
+                    )
+                    model_pred = transformer(
+                        hidden_states=packed_noisy_model_input,
+                        # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                        timestep=timesteps / 1000,
+                        guidance=guidance,
+                        pooled_projections=pooled_prompt_embeds,
+                        encoder_hidden_states=prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        return_dict=False,
+                    )[0]
+
                model_pred = FluxPipeline._unpack_latents(
                    model_pred,
-                    height=int(model_input.shape[2] * vae_scale_factor / 2),
-                    width=int(model_input.shape[3] * vae_scale_factor / 2),
+                    height=int(model_input.shape[2]),
+                    width=int(model_input.shape[3]),
                    vae_scale_factor=vae_scale_factor,
                )

@@ -89,7 +89,6 @@ else:
            "ControlNetXSAdapter",
            "DiTTransformer2DModel",
            "FluxControlNetModel",
-            "FluxMultiControlNetModel",
            "FluxTransformer2DModel",
            "HunyuanDiT2DControlNetModel",
            "HunyuanDiT2DModel",
@@ -208,8 +208,6 @@ class IPAdapterMixin:
                            pretrained_model_name_or_path_or_dict,
                            subfolder=image_encoder_subfolder,
                            low_cpu_mem_usage=low_cpu_mem_usage,
-                            cache_dir=cache_dir,
-                            local_files_only=local_files_only,
                        ).to(self.device, dtype=self.dtype)
                        self.register_modules(image_encoder=image_encoder)
                    else:
@@ -14,8 +14,6 @@

 import re

-import torch
-
 from ..utils import is_peft_version, logging


@@ -328,294 +326,3 @@ def _get_alpha_name(lora_name_alpha, diffusers_name, alpha):
        prefix = "text_encoder_2."
    new_name = prefix + diffusers_name.split(".lora.")[0] + ".alpha"
    return {new_name: alpha}
-
-
-# The utilities under `_convert_kohya_flux_lora_to_diffusers()`
-# are taken from https://github.com/kohya-ss/sd-scripts/blob/a61cf73a5cb5209c3f4d1a3688dd276a4dfd1ecb/networks/convert_flux_lora.py
-# All credits go to `kohya-ss`.
-def _convert_kohya_flux_lora_to_diffusers(state_dict):
-    def _convert_to_ai_toolkit(sds_sd, ait_sd, sds_key, ait_key):
-        if sds_key + ".lora_down.weight" not in sds_sd:
-            return
-        down_weight = sds_sd.pop(sds_key + ".lora_down.weight")
-
-        # scale weight by alpha and dim
-        rank = down_weight.shape[0]
-        alpha = sds_sd.pop(sds_key + ".alpha").item()  # alpha is scalar
-        scale = alpha / rank  # LoRA is scaled by 'alpha / rank' in forward pass, so we need to scale it back here
-
-        # calculate scale_down and scale_up to keep the same value. if scale is 4, scale_down is 2 and scale_up is 2
-        scale_down = scale
-        scale_up = 1.0
-        while scale_down * 2 < scale_up:
-            scale_down *= 2
-            scale_up /= 2
-
-        ait_sd[ait_key + ".lora_A.weight"] = down_weight * scale_down
-        ait_sd[ait_key + ".lora_B.weight"] = sds_sd.pop(sds_key + ".lora_up.weight") * scale_up
-
-    def _convert_to_ai_toolkit_cat(sds_sd, ait_sd, sds_key, ait_keys, dims=None):
-        if sds_key + ".lora_down.weight" not in sds_sd:
-            return
-        down_weight = sds_sd.pop(sds_key + ".lora_down.weight")
-        up_weight = sds_sd.pop(sds_key + ".lora_up.weight")
-        sd_lora_rank = down_weight.shape[0]
-
-        # scale weight by alpha and dim
-        alpha = sds_sd.pop(sds_key + ".alpha")
-        scale = alpha / sd_lora_rank
-
-        # calculate scale_down and scale_up
-        scale_down = scale
-        scale_up = 1.0
-        while scale_down * 2 < scale_up:
-            scale_down *= 2
-            scale_up /= 2
-
-        down_weight = down_weight * scale_down
-        up_weight = up_weight * scale_up
-
-        # calculate dims if not provided
-        num_splits = len(ait_keys)
-        if dims is None:
-            dims = [up_weight.shape[0] // num_splits] * num_splits
-        else:
-            assert sum(dims) == up_weight.shape[0]
-
-        # check upweight is sparse or not
-        is_sparse = False
-        if sd_lora_rank % num_splits == 0:
-            ait_rank = sd_lora_rank // num_splits
-            is_sparse = True
-            i = 0
-            for j in range(len(dims)):
-                for k in range(len(dims)):
-                    if j == k:
-                        continue
-                    is_sparse = is_sparse and torch.all(
-                        up_weight[i : i + dims[j], k * ait_rank : (k + 1) * ait_rank] == 0
-                    )
-                i += dims[j]
-            if is_sparse:
-                logger.info(f"weight is sparse: {sds_key}")
-
-        # make ai-toolkit weight
-        ait_down_keys = [k + ".lora_A.weight" for k in ait_keys]
-        ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]
-        if not is_sparse:
-            # down_weight is copied to each split
-            ait_sd.update({k: down_weight for k in ait_down_keys})
-
-            # up_weight is split to each split
-            ait_sd.update({k: v for k, v in zip(ait_up_keys, torch.split(up_weight, dims, dim=0))})  # noqa: C416
-        else:
-            # down_weight is chunked to each split
-            ait_sd.update({k: v for k, v in zip(ait_down_keys, torch.chunk(down_weight, num_splits, dim=0))})  # noqa: C416
-
-            # up_weight is sparse: only non-zero values are copied to each split
-            i = 0
-            for j in range(len(dims)):
-                ait_sd[ait_up_keys[j]] = up_weight[i : i + dims[j], j * ait_rank : (j + 1) * ait_rank].contiguous()
-                i += dims[j]
-
-    def _convert_sd_scripts_to_ai_toolkit(sds_sd):
-        ait_sd = {}
-        for i in range(19):
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_img_attn_proj",
-                f"transformer.transformer_blocks.{i}.attn.to_out.0",
-            )
-            _convert_to_ai_toolkit_cat(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_img_attn_qkv",
-                [
-                    f"transformer.transformer_blocks.{i}.attn.to_q",
-                    f"transformer.transformer_blocks.{i}.attn.to_k",
-                    f"transformer.transformer_blocks.{i}.attn.to_v",
-                ],
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_img_mlp_0",
-                f"transformer.transformer_blocks.{i}.ff.net.0.proj",
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_img_mlp_2",
-                f"transformer.transformer_blocks.{i}.ff.net.2",
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_img_mod_lin",
-                f"transformer.transformer_blocks.{i}.norm1.linear",
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_txt_attn_proj",
-                f"transformer.transformer_blocks.{i}.attn.to_add_out",
-            )
-            _convert_to_ai_toolkit_cat(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_txt_attn_qkv",
-                [
-                    f"transformer.transformer_blocks.{i}.attn.add_q_proj",
-                    f"transformer.transformer_blocks.{i}.attn.add_k_proj",
-                    f"transformer.transformer_blocks.{i}.attn.add_v_proj",
-                ],
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_txt_mlp_0",
-                f"transformer.transformer_blocks.{i}.ff_context.net.0.proj",
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_txt_mlp_2",
-                f"transformer.transformer_blocks.{i}.ff_context.net.2",
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_double_blocks_{i}_txt_mod_lin",
-                f"transformer.transformer_blocks.{i}.norm1_context.linear",
-            )
-
-        for i in range(38):
-            _convert_to_ai_toolkit_cat(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_single_blocks_{i}_linear1",
-                [
-                    f"transformer.single_transformer_blocks.{i}.attn.to_q",
-                    f"transformer.single_transformer_blocks.{i}.attn.to_k",
-                    f"transformer.single_transformer_blocks.{i}.attn.to_v",
-                    f"transformer.single_transformer_blocks.{i}.proj_mlp",
-                ],
-                dims=[3072, 3072, 3072, 12288],
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_single_blocks_{i}_linear2",
-                f"transformer.single_transformer_blocks.{i}.proj_out",
-            )
-            _convert_to_ai_toolkit(
-                sds_sd,
-                ait_sd,
-                f"lora_unet_single_blocks_{i}_modulation_lin",
-                f"transformer.single_transformer_blocks.{i}.norm.linear",
-            )
-
-        if len(sds_sd) > 0:
-            logger.warning(f"Unsuppored keys for ai-toolkit: {sds_sd.keys()}")
-
-        return ait_sd
-
-    return _convert_sd_scripts_to_ai_toolkit(state_dict)
-
-
-# Adapted from https://gist.github.com/Leommm-byte/6b331a1e9bd53271210b26543a7065d6
-# Some utilities were reused from
-# https://github.com/kohya-ss/sd-scripts/blob/a61cf73a5cb5209c3f4d1a3688dd276a4dfd1ecb/networks/convert_flux_lora.py
-def _convert_xlabs_flux_lora_to_diffusers(old_state_dict):
-    new_state_dict = {}
-    orig_keys = list(old_state_dict.keys())
-
-    def handle_qkv(sds_sd, ait_sd, sds_key, ait_keys, dims=None):
-        down_weight = sds_sd.pop(sds_key)
-        up_weight = sds_sd.pop(sds_key.replace(".down.weight", ".up.weight"))
-
-        # calculate dims if not provided
-        num_splits = len(ait_keys)
-        if dims is None:
-            dims = [up_weight.shape[0] // num_splits] * num_splits
-        else:
-            assert sum(dims) == up_weight.shape[0]
-
-        # make ai-toolkit weight
-        ait_down_keys = [k + ".lora_A.weight" for k in ait_keys]
-        ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]
-
-        # down_weight is copied to each split
-        ait_sd.update({k: down_weight for k in ait_down_keys})
-
-        # up_weight is split to each split
-        ait_sd.update({k: v for k, v in zip(ait_up_keys, torch.split(up_weight, dims, dim=0))})  # noqa: C416
-
-    for old_key in orig_keys:
-        # Handle double_blocks
-        if old_key.startswith(("diffusion_model.double_blocks", "double_blocks")):
-            block_num = re.search(r"double_blocks\.(\d+)", old_key).group(1)
-            new_key = f"transformer.transformer_blocks.{block_num}"
-
-            if "processor.proj_lora1" in old_key:
-                new_key += ".attn.to_out.0"
-            elif "processor.proj_lora2" in old_key:
-                new_key += ".attn.to_add_out"
-            elif "processor.qkv_lora1" in old_key and "up" not in old_key:
-                handle_qkv(
-                    old_state_dict,
-                    new_state_dict,
-                    old_key,
-                    [
-                        f"transformer.transformer_blocks.{block_num}.attn.add_q_proj",
-                        f"transformer.transformer_blocks.{block_num}.attn.add_k_proj",
-                        f"transformer.transformer_blocks.{block_num}.attn.add_v_proj",
-                    ],
-                )
-                # continue
-            elif "processor.qkv_lora2" in old_key and "up" not in old_key:
-                handle_qkv(
-                    old_state_dict,
-                    new_state_dict,
-                    old_key,
-                    [
-                        f"transformer.transformer_blocks.{block_num}.attn.to_q",
-                        f"transformer.transformer_blocks.{block_num}.attn.to_k",
-                        f"transformer.transformer_blocks.{block_num}.attn.to_v",
-                    ],
-                )
-                # continue
-
-            if "down" in old_key:
-                new_key += ".lora_A.weight"
-            elif "up" in old_key:
-                new_key += ".lora_B.weight"
-
-        # Handle single_blocks
-        elif old_key.startswith("diffusion_model.single_blocks", "single_blocks"):
-            block_num = re.search(r"single_blocks\.(\d+)", old_key).group(1)
-            new_key = f"transformer.single_transformer_blocks.{block_num}"
-
-            if "proj_lora1" in old_key or "proj_lora2" in old_key:
-                new_key += ".proj_out"
-            elif "qkv_lora1" in old_key or "qkv_lora2" in old_key:
-                new_key += ".norm.linear"
-
-            if "down" in old_key:
-                new_key += ".lora_A.weight"
-            elif "up" in old_key:
-                new_key += ".lora_B.weight"
-
-        else:
-            # Handle other potential key patterns here
-            new_key = old_key
-
-        # Since we already handle qkv above.
-        if "qkv" not in old_key:
-            new_state_dict[new_key] = old_state_dict.pop(old_key)
-
-    if len(old_state_dict) > 0:
-        raise ValueError(f"`old_state_dict` should be at this point but has: {list(old_state_dict.keys())}.")
-
-    return new_state_dict
@@ -31,12 +31,7 @@ from ..utils import (
    scale_lora_layers,
 )
 from .lora_base import LoraBaseMixin
-from .lora_conversion_utils import (
-    _convert_kohya_flux_lora_to_diffusers,
-    _convert_non_diffusers_lora_to_diffusers,
-    _convert_xlabs_flux_lora_to_diffusers,
-    _maybe_map_sgm_blocks_to_diffusers,
-)
+from .lora_conversion_utils import _convert_non_diffusers_lora_to_diffusers, _maybe_map_sgm_blocks_to_diffusers


 if is_transformers_available():
@@ -1588,20 +1583,6 @@ class FluxLoraLoaderMixin(LoraBaseMixin):
            allow_pickle=allow_pickle,
        )

-        # TODO (sayakpaul): to a follow-up to clean and try to unify the conditions.
-
-        is_kohya = any(".lora_down.weight" in k for k in state_dict)
-        if is_kohya:
-            state_dict = _convert_kohya_flux_lora_to_diffusers(state_dict)
-            # Kohya already takes care of scaling the LoRA parameters with alpha.
-            return (state_dict, None) if return_alphas else state_dict
-
-        is_xlabs = any("processor" in k for k in state_dict)
-        if is_xlabs:
-            state_dict = _convert_xlabs_flux_lora_to_diffusers(state_dict)
-            # xlabs doesn't use `alpha`.
-            return (state_dict, None) if return_alphas else state_dict
-
        # For state dicts like
        # https://huggingface.co/TheLastBen/Jon_Snow_Flux_LoRA
        keys = list(state_dict.keys())
@@ -91,11 +91,11 @@ DIFFUSERS_DEFAULT_PIPELINE_PATHS = {
    "xl_inpaint": {"pretrained_model_name_or_path": "diffusers/stable-diffusion-xl-1.0-inpainting-0.1"},
    "playground-v2-5": {"pretrained_model_name_or_path": "playgroundai/playground-v2.5-1024px-aesthetic"},
    "upscale": {"pretrained_model_name_or_path": "stabilityai/stable-diffusion-x4-upscaler"},
-    "inpainting": {"pretrained_model_name_or_path": "Lykon/dreamshaper-8-inpainting"},
+    "inpainting": {"pretrained_model_name_or_path": "runwayml/stable-diffusion-inpainting"},
    "inpainting_v2": {"pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-inpainting"},
    "controlnet": {"pretrained_model_name_or_path": "lllyasviel/control_v11p_sd15_canny"},
    "v2": {"pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-1"},
-    "v1": {"pretrained_model_name_or_path": "Lykon/dreamshaper-8"},
+    "v1": {"pretrained_model_name_or_path": "runwayml/stable-diffusion-v1-5"},
    "stable_cascade_stage_b": {"pretrained_model_name_or_path": "stabilityai/stable-cascade", "subfolder": "decoder"},
    "stable_cascade_stage_b_lite": {
        "pretrained_model_name_or_path": "stabilityai/stable-cascade",
@@ -972,32 +972,15 @@ class FreeNoiseTransformerBlock(nn.Module):
        return frame_indices

    def _get_frame_weights(self, num_frames: int, weighting_scheme: str = "pyramid") -> List[float]:
-        if weighting_scheme == "flat":
-            weights = [1.0] * num_frames
-
-        elif weighting_scheme == "pyramid":
+        if weighting_scheme == "pyramid":
            if num_frames % 2 == 0:
                # num_frames = 4 => [1, 2, 2, 1]
-                mid = num_frames // 2
-                weights = list(range(1, mid + 1))
+                weights = list(range(1, num_frames // 2 + 1))
                weights = weights + weights[::-1]
            else:
                # num_frames = 5 => [1, 2, 3, 2, 1]
-                mid = (num_frames + 1) // 2
-                weights = list(range(1, mid))
-                weights = weights + [mid] + weights[::-1]
-
-        elif weighting_scheme == "delayed_reverse_sawtooth":
-            if num_frames % 2 == 0:
-                # num_frames = 4 => [0.01, 2, 2, 1]
-                mid = num_frames // 2
-                weights = [0.01] * (mid - 1) + [mid]
-                weights = weights + list(range(mid, 0, -1))
-            else:
-                # num_frames = 5 => [0.01, 0.01, 3, 2, 1]
-                mid = (num_frames + 1) // 2
-                weights = [0.01] * mid
-                weights = weights + list(range(mid, 0, -1))
+                weights = list(range(1, num_frames // 2 + 1))
+                weights = weights + [num_frames // 2 + 1] + weights[::-1]
        else:
            raise ValueError(f"Unsupported value for weighting_scheme={weighting_scheme}")

@@ -691,6 +691,7 @@ class SparseControlNetModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):

        emb = self.time_embedding(t_emb, timestep_cond)
        emb = emb.repeat_interleave(sample_num_frames, dim=0)
+        encoder_hidden_states = encoder_hidden_states.repeat_interleave(sample_num_frames, dim=0)

        # 2. pre-process
        batch_size, channels, num_frames, height, width = sample.shape
@@ -514,7 +514,7 @@ def get_1d_rotary_pos_embed(
    linear_factor=1.0,
    ntk_factor=1.0,
    repeat_interleave_real=True,
-    freqs_dtype=torch.float32,  #  torch.float32, torch.float64 (flux)
+    freqs_dtype=torch.float32,  # torch.float32 (hunyuan, stable audio), torch.float64 (flux)
 ):
    """
    Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
@@ -545,27 +545,21 @@ def get_1d_rotary_pos_embed(
    assert dim % 2 == 0

    if isinstance(pos, int):
-        pos = torch.arange(pos)
-    if isinstance(pos, np.ndarray):
-        pos = torch.from_numpy(pos)  # type: ignore  # [S]
-
+        pos = np.arange(pos)
    theta = theta * ntk_factor
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=freqs_dtype)[: (dim // 2)] / dim)) / linear_factor  # [D/2]
-    freqs = freqs.to(pos.device)
-    freqs = torch.outer(pos, freqs)  # type: ignore   # [S, D/2]
+    t = torch.from_numpy(pos).to(freqs.device)  # type: ignore  # [S]
+    freqs = torch.outer(t, freqs)  # type: ignore   # [S, D/2]
    if use_real and repeat_interleave_real:
-        # flux, hunyuan-dit, cogvideox
        freqs_cos = freqs.cos().repeat_interleave(2, dim=1).float()  # [S, D]
        freqs_sin = freqs.sin().repeat_interleave(2, dim=1).float()  # [S, D]
        return freqs_cos, freqs_sin
    elif use_real:
-        # stable audio
        freqs_cos = torch.cat([freqs.cos(), freqs.cos()], dim=-1).float()  # [S, D]
        freqs_sin = torch.cat([freqs.sin(), freqs.sin()], dim=-1).float()  # [S, D]
        return freqs_cos, freqs_sin
    else:
-        # lumina
-        freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64     # [S, D/2]
+        freqs_cis = torch.polar(torch.ones_like(freqs), freqs).float()  # complex64     # [S, D/2]
        return freqs_cis


@@ -596,11 +590,11 @@ def apply_rotary_emb(
        cos, sin = cos.to(x.device), sin.to(x.device)

        if use_real_unbind_dim == -1:
-            # Used for flux, cogvideox, hunyuan-dit
+            # Use for example in Lumina
            x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1)  # [B, S, H, D//2]
            x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3)
        elif use_real_unbind_dim == -2:
-            # Used for Stable Audio
+            # Use for example in Stable Audio
            x_real, x_imag = x.reshape(*x.shape[:-1], 2, -1).unbind(-2)  # [B, S, H, D//2]
            x_rotated = torch.cat([-x_imag, x_real], dim=-1)
        else:
@@ -610,7 +604,6 @@ def apply_rotary_emb(

        return out
    else:
-        # used for lumina
        x_rotated = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
        freqs_cis = freqs_cis.unsqueeze(2)
        x_out = torch.view_as_real(x_rotated * freqs_cis).flatten(3)
@@ -629,7 +622,7 @@ class FluxPosEmbed(nn.Module):
        n_axes = ids.shape[-1]
        cos_out = []
        sin_out = []
-        pos = ids.squeeze().float()
+        pos = ids.squeeze().float().cpu().numpy()
        is_mps = ids.device.type == "mps"
        freqs_dtype = torch.float32 if is_mps else torch.float64
        for i in range(n_axes):
@@ -116,7 +116,7 @@ class AnimateDiffTransformer3D(nn.Module):

        self.in_channels = in_channels

-        self.norm = nn.GroupNorm(num_groups=norm_num_groups, num_channels=in_channels, eps=1e-6, affine=True)
+        self.norm = torch.nn.GroupNorm(num_groups=norm_num_groups, num_channels=in_channels, eps=1e-6, affine=True)
        self.proj_in = nn.Linear(in_channels, inner_dim)

        # 3. Define transformers blocks
@@ -2178,6 +2178,7 @@ class UNetMotionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin, Peft

        emb = emb if aug_emb is None else emb + aug_emb
        emb = emb.repeat_interleave(repeats=num_frames, dim=0)
+        encoder_hidden_states = encoder_hidden_states.repeat_interleave(repeats=num_frames, dim=0)

        if self.encoder_hid_proj is not None and self.config.encoder_hid_dim_type == "ip_image_proj":
            if "image_embeds" not in added_cond_kwargs:
@@ -432,6 +432,7 @@ class AnimateDiffPipeline(
            extra_step_kwargs["generator"] = generator
        return extra_step_kwargs

+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.check_inputs
    def check_inputs(
        self,
        prompt,
@@ -469,8 +470,8 @@ class AnimateDiffPipeline(
            raise ValueError(
                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
            )
-        elif prompt is not None and not isinstance(prompt, (str, list, dict)):
-            raise ValueError(f"`prompt` has to be of type `str`, `list` or `dict` but is {type(prompt)=}")
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
@@ -556,15 +557,11 @@ class AnimateDiffPipeline(
    def num_timesteps(self):
        return self._num_timesteps

-    @property
-    def interrupt(self):
-        return self._interrupt
-
    @torch.no_grad()
    @replace_example_docstring(EXAMPLE_DOC_STRING)
    def __call__(
        self,
-        prompt: Optional[Union[str, List[str]]] = None,
+        prompt: Union[str, List[str]] = None,
        num_frames: Optional[int] = 16,
        height: Optional[int] = None,
        width: Optional[int] = None,
@@ -704,10 +701,9 @@ class AnimateDiffPipeline(
        self._guidance_scale = guidance_scale
        self._clip_skip = clip_skip
        self._cross_attention_kwargs = cross_attention_kwargs
-        self._interrupt = False

        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, (str, dict)):
+        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
@@ -720,39 +716,22 @@ class AnimateDiffPipeline(
        text_encoder_lora_scale = (
            self.cross_attention_kwargs.get("scale", None) if self.cross_attention_kwargs is not None else None
        )
-        if self.free_noise_enabled:
-            prompt_embeds, negative_prompt_embeds = self._encode_prompt_free_noise(
-                prompt=prompt,
-                num_frames=num_frames,
-                device=device,
-                num_videos_per_prompt=num_videos_per_prompt,
-                do_classifier_free_guidance=self.do_classifier_free_guidance,
-                negative_prompt=negative_prompt,
-                prompt_embeds=prompt_embeds,
-                negative_prompt_embeds=negative_prompt_embeds,
-                lora_scale=text_encoder_lora_scale,
-                clip_skip=self.clip_skip,
-            )
-        else:
-            prompt_embeds, negative_prompt_embeds = self.encode_prompt(
-                prompt,
-                device,
-                num_videos_per_prompt,
-                self.do_classifier_free_guidance,
-                negative_prompt,
-                prompt_embeds=prompt_embeds,
-                negative_prompt_embeds=negative_prompt_embeds,
-                lora_scale=text_encoder_lora_scale,
-                clip_skip=self.clip_skip,
-            )
-
-            # For classifier free guidance, we need to do two forward passes.
-            # Here we concatenate the unconditional and text embeddings into a single batch
-            # to avoid doing two forward passes
-            if self.do_classifier_free_guidance:
-                prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
-
-            prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
+            prompt,
+            device,
+            num_videos_per_prompt,
+            self.do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+            clip_skip=self.clip_skip,
+        )
+        # For classifier free guidance, we need to do two forward passes.
+        # Here we concatenate the unconditional and text embeddings into a single batch
+        # to avoid doing two forward passes
+        if self.do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            image_embeds = self.prepare_ip_adapter_image_embeds(
@@ -804,9 +783,6 @@ class AnimateDiffPipeline(
            # 8. Denoising loop
            with self.progress_bar(total=self._num_timesteps) as progress_bar:
                for i, t in enumerate(timesteps):
-                    if self.interrupt:
-                        continue
-
                    # expand the latents if we are doing classifier free guidance
                    latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
@@ -505,8 +505,8 @@ class AnimateDiffControlNetPipeline(
            raise ValueError(
                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
            )
-        elif prompt is not None and not isinstance(prompt, (str, list, dict)):
-            raise ValueError(f"`prompt` has to be of type `str`, `list` or `dict` but is {type(prompt)}")
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
@@ -699,10 +699,6 @@ class AnimateDiffControlNetPipeline(
    def num_timesteps(self):
        return self._num_timesteps

-    @property
-    def interrupt(self):
-        return self._interrupt
-
    @torch.no_grad()
    def __call__(
        self,
@@ -862,10 +858,9 @@ class AnimateDiffControlNetPipeline(
        self._guidance_scale = guidance_scale
        self._clip_skip = clip_skip
        self._cross_attention_kwargs = cross_attention_kwargs
-        self._interrupt = False

        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, (str, dict)):
+        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
@@ -888,39 +883,22 @@ class AnimateDiffControlNetPipeline(
        text_encoder_lora_scale = (
            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
        )
-        if self.free_noise_enabled:
-            prompt_embeds, negative_prompt_embeds = self._encode_prompt_free_noise(
-                prompt=prompt,
-                num_frames=num_frames,
-                device=device,
-                num_videos_per_prompt=num_videos_per_prompt,
-                do_classifier_free_guidance=self.do_classifier_free_guidance,
-                negative_prompt=negative_prompt,
-                prompt_embeds=prompt_embeds,
-                negative_prompt_embeds=negative_prompt_embeds,
-                lora_scale=text_encoder_lora_scale,
-                clip_skip=self.clip_skip,
-            )
-        else:
-            prompt_embeds, negative_prompt_embeds = self.encode_prompt(
-                prompt,
-                device,
-                num_videos_per_prompt,
-                self.do_classifier_free_guidance,
-                negative_prompt,
-                prompt_embeds=prompt_embeds,
-                negative_prompt_embeds=negative_prompt_embeds,
-                lora_scale=text_encoder_lora_scale,
-                clip_skip=self.clip_skip,
-            )
-
-            # For classifier free guidance, we need to do two forward passes.
-            # Here we concatenate the unconditional and text embeddings into a single batch
-            # to avoid doing two forward passes
-            if self.do_classifier_free_guidance:
-                prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
-
-            prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
+            prompt,
+            device,
+            num_videos_per_prompt,
+            self.do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+            clip_skip=self.clip_skip,
+        )
+        # For classifier free guidance, we need to do two forward passes.
+        # Here we concatenate the unconditional and text embeddings into a single batch
+        # to avoid doing two forward passes
+        if self.do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            image_embeds = self.prepare_ip_adapter_image_embeds(
@@ -1012,9 +990,6 @@ class AnimateDiffControlNetPipeline(
            # 8. Denoising loop
            with self.progress_bar(total=self._num_timesteps) as progress_bar:
                for i, t in enumerate(timesteps):
-                    if self.interrupt:
-                        continue
-
                    # expand the latents if we are doing classifier free guidance
                    latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
@@ -1027,6 +1002,7 @@ class AnimateDiffControlNetPipeline(
                    else:
                        control_model_input = latent_model_input
                        controlnet_prompt_embeds = prompt_embeds
+                    controlnet_prompt_embeds = controlnet_prompt_embeds.repeat_interleave(num_frames, dim=0)

                    if isinstance(controlnet_keep[i], list):
                        cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
@@ -1143,8 +1143,6 @@ class AnimateDiffSDXLPipeline(
            add_text_embeds = torch.cat([negative_pooled_prompt_embeds, add_text_embeds], dim=0)
            add_time_ids = torch.cat([negative_add_time_ids, add_time_ids], dim=0)

-        prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
-
        prompt_embeds = prompt_embeds.to(device)
        add_text_embeds = add_text_embeds.to(device)
        add_time_ids = add_time_ids.to(device).repeat(batch_size * num_videos_per_prompt, 1)
@@ -878,8 +878,6 @@ class AnimateDiffSparseControlNetPipeline(
        if self.do_classifier_free_guidance:
            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

-        prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
-
        # 4. Prepare IP-Adapter embeddings
        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            image_embeds = self.prepare_ip_adapter_image_embeds(
@@ -246,6 +246,7 @@ class AnimateDiffVideoToVideoPipeline(
        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
        self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor)

+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.encode_prompt with num_images_per_prompt -> num_videos_per_prompt
    def encode_prompt(
        self,
        prompt,
@@ -298,7 +299,7 @@ class AnimateDiffVideoToVideoPipeline(
            else:
                scale_lora_layers(self.text_encoder, lora_scale)

-        if prompt is not None and isinstance(prompt, (str, dict)):
+        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
@@ -581,8 +582,8 @@ class AnimateDiffVideoToVideoPipeline(
            raise ValueError(
                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
            )
-        elif prompt is not None and not isinstance(prompt, (str, list, dict)):
-            raise ValueError(f"`prompt` has to be of type `str`, `list` or `dict` but is {type(prompt)}")
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")

        if negative_prompt is not None and negative_prompt_embeds is not None:
            raise ValueError(
@@ -627,20 +628,23 @@ class AnimateDiffVideoToVideoPipeline(

    def prepare_latents(
        self,
-        video: Optional[torch.Tensor] = None,
-        height: int = 64,
-        width: int = 64,
-        num_channels_latents: int = 4,
-        batch_size: int = 1,
-        timestep: Optional[int] = None,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
+        video,
+        height,
+        width,
+        num_channels_latents,
+        batch_size,
+        timestep,
+        dtype,
+        device,
+        generator,
+        latents=None,
        decode_chunk_size: int = 16,
-        add_noise: bool = False,
-    ) -> torch.Tensor:
-        num_frames = video.shape[1] if latents is None else latents.shape[2]
+    ):
+        if latents is None:
+            num_frames = video.shape[1]
+        else:
+            num_frames = latents.shape[2]
+
        shape = (
            batch_size,
            num_channels_latents,
@@ -704,13 +708,8 @@ class AnimateDiffVideoToVideoPipeline(
            if shape != latents.shape:
                # [B, C, F, H, W]
                raise ValueError(f"`latents` expected to have {shape=}, but found {latents.shape=}")
-
            latents = latents.to(device, dtype=dtype)

-            if add_noise:
-                noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-                latents = self.scheduler.add_noise(latents, noise, timestep)
-
        return latents

    @property
@@ -736,10 +735,6 @@ class AnimateDiffVideoToVideoPipeline(
    def num_timesteps(self):
        return self._num_timesteps

-    @property
-    def interrupt(self):
-        return self._interrupt
-
    @torch.no_grad()
    def __call__(
        self,
@@ -748,7 +743,6 @@ class AnimateDiffVideoToVideoPipeline(
        height: Optional[int] = None,
        width: Optional[int] = None,
        num_inference_steps: int = 50,
-        enforce_inference_steps: bool = False,
        timesteps: Optional[List[int]] = None,
        sigmas: Optional[List[float]] = None,
        guidance_scale: float = 7.5,
@@ -880,10 +874,9 @@ class AnimateDiffVideoToVideoPipeline(
        self._guidance_scale = guidance_scale
        self._clip_skip = clip_skip
        self._cross_attention_kwargs = cross_attention_kwargs
-        self._interrupt = False

        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, (str, dict)):
+        if prompt is not None and isinstance(prompt, str):
            batch_size = 1
        elif prompt is not None and isinstance(prompt, list):
            batch_size = len(prompt)
@@ -891,85 +884,29 @@ class AnimateDiffVideoToVideoPipeline(
            batch_size = prompt_embeds.shape[0]

        device = self._execution_device
-        dtype = self.dtype

-        # 3. Prepare timesteps
-        if not enforce_inference_steps:
-            timesteps, num_inference_steps = retrieve_timesteps(
-                self.scheduler, num_inference_steps, device, timesteps, sigmas
-            )
-            timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, timesteps, strength, device)
-            latent_timestep = timesteps[:1].repeat(batch_size * num_videos_per_prompt)
-        else:
-            denoising_inference_steps = int(num_inference_steps / strength)
-            timesteps, denoising_inference_steps = retrieve_timesteps(
-                self.scheduler, denoising_inference_steps, device, timesteps, sigmas
-            )
-            timesteps = timesteps[-num_inference_steps:]
-            latent_timestep = timesteps[:1].repeat(batch_size * num_videos_per_prompt)
-
-        # 4. Prepare latent variables
-        if latents is None:
-            video = self.video_processor.preprocess_video(video, height=height, width=width)
-            # Move the number of frames before the number of channels.
-            video = video.permute(0, 2, 1, 3, 4)
-            video = video.to(device=device, dtype=dtype)
-        num_channels_latents = self.unet.config.in_channels
-        latents = self.prepare_latents(
-            video=video,
-            height=height,
-            width=width,
-            num_channels_latents=num_channels_latents,
-            batch_size=batch_size * num_videos_per_prompt,
-            timestep=latent_timestep,
-            dtype=dtype,
-            device=device,
-            generator=generator,
-            latents=latents,
-            decode_chunk_size=decode_chunk_size,
-            add_noise=enforce_inference_steps,
-        )
-
-        # 5. Encode input prompt
+        # 3. Encode input prompt
        text_encoder_lora_scale = (
            self.cross_attention_kwargs.get("scale", None) if self.cross_attention_kwargs is not None else None
        )
-        num_frames = latents.shape[2]
-        if self.free_noise_enabled:
-            prompt_embeds, negative_prompt_embeds = self._encode_prompt_free_noise(
-                prompt=prompt,
-                num_frames=num_frames,
-                device=device,
-                num_videos_per_prompt=num_videos_per_prompt,
-                do_classifier_free_guidance=self.do_classifier_free_guidance,
-                negative_prompt=negative_prompt,
-                prompt_embeds=prompt_embeds,
-                negative_prompt_embeds=negative_prompt_embeds,
-                lora_scale=text_encoder_lora_scale,
-                clip_skip=self.clip_skip,
-            )
-        else:
-            prompt_embeds, negative_prompt_embeds = self.encode_prompt(
-                prompt,
-                device,
-                num_videos_per_prompt,
-                self.do_classifier_free_guidance,
-                negative_prompt,
-                prompt_embeds=prompt_embeds,
-                negative_prompt_embeds=negative_prompt_embeds,
-                lora_scale=text_encoder_lora_scale,
-                clip_skip=self.clip_skip,
-            )
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
+            prompt,
+            device,
+            num_videos_per_prompt,
+            self.do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+            clip_skip=self.clip_skip,
+        )

-            # For classifier free guidance, we need to do two forward passes.
-            # Here we concatenate the unconditional and text embeddings into a single batch
-            # to avoid doing two forward passes
-            if self.do_classifier_free_guidance:
-                prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
+        # For classifier free guidance, we need to do two forward passes.
+        # Here we concatenate the unconditional and text embeddings into a single batch
+        # to avoid doing two forward passes
+        if self.do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

-            prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
-
-        # 6. Prepare IP-Adapter embeddings
        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            image_embeds = self.prepare_ip_adapter_image_embeds(
                ip_adapter_image,
@@ -979,10 +916,38 @@ class AnimateDiffVideoToVideoPipeline(
                self.do_classifier_free_guidance,
            )

-        # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        # 4. Prepare timesteps
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler, num_inference_steps, device, timesteps, sigmas
+        )
+        timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, timesteps, strength, device)
+        latent_timestep = timesteps[:1].repeat(batch_size * num_videos_per_prompt)
+
+        # 5. Prepare latent variables
+        if latents is None:
+            video = self.video_processor.preprocess_video(video, height=height, width=width)
+            # Move the number of frames before the number of channels.
+            video = video.permute(0, 2, 1, 3, 4)
+            video = video.to(device=device, dtype=prompt_embeds.dtype)
+        num_channels_latents = self.unet.config.in_channels
+        latents = self.prepare_latents(
+            video=video,
+            height=height,
+            width=width,
+            num_channels_latents=num_channels_latents,
+            batch_size=batch_size * num_videos_per_prompt,
+            timestep=latent_timestep,
+            dtype=prompt_embeds.dtype,
+            device=device,
+            generator=generator,
+            latents=latents,
+            decode_chunk_size=decode_chunk_size,
+        )
+
+        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)

-        # 8. Add image embeds for IP-Adapter
+        # 7. Add image embeds for IP-Adapter
        added_cond_kwargs = (
            {"image_embeds": image_embeds}
            if ip_adapter_image is not None or ip_adapter_image_embeds is not None
@@ -1002,12 +967,9 @@ class AnimateDiffVideoToVideoPipeline(
            self._num_timesteps = len(timesteps)
            num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order

-            # 9. Denoising loop
+            # 8. Denoising loop
            with self.progress_bar(total=self._num_timesteps) as progress_bar:
                for i, t in enumerate(timesteps):
-                    if self.interrupt:
-                        continue
-
                    # expand the latents if we are doing classifier free guidance
                    latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
@@ -1043,14 +1005,14 @@ class AnimateDiffVideoToVideoPipeline(
                    if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                        progress_bar.update()

-        # 10. Post-processing
+        # 9. Post-processing
        if output_type == "latent":
            video = latents
        else:
            video_tensor = self.decode_latents(latents, decode_chunk_size)
            video = self.video_processor.postprocess_video(video=video_tensor, output_type=output_type)

-        # 11. Offload all models
+        # 10. Offload all models
        self.maybe_free_model_hooks()

        if not return_dict:
@@ -280,7 +280,7 @@ class FluxPipeline(DiffusionPipeline, FluxLoraLoaderMixin, FromSingleFileMixin):
        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)

        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt)
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, -1)

        return prompt_embeds
@@ -536,7 +536,7 @@ class FluxPipeline(DiffusionPipeline, FluxLoraLoaderMixin, FromSingleFileMixin):
        width: Optional[int] = None,
        num_inference_steps: int = 28,
        timesteps: List[int] = None,
-        guidance_scale: float = 3.5,
+        guidance_scale: float = 7.0,
        num_images_per_prompt: Optional[int] = 1,
        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
        latents: Optional[torch.FloatTensor] = None,
@@ -302,7 +302,7 @@ class FluxControlNetPipeline(DiffusionPipeline, FluxLoraLoaderMixin, FromSingleF
        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)

        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt)
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, -1)

        return prompt_embeds
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from typing import Callable, Dict, Optional, Union
+from typing import Optional, Union

 import torch

@@ -22,7 +22,6 @@ from ..models.unets.unet_motion_model import (
    DownBlockMotion,
    UpBlockMotion,
 )
-from ..pipelines.pipeline_utils import DiffusionPipeline
 from ..utils import logging
 from ..utils.torch_utils import randn_tensor

@@ -99,142 +98,6 @@ class AnimateDiffFreeNoiseMixin:
                        free_noise_transfomer_block.state_dict(), strict=True
                    )

-    def _check_inputs_free_noise(
-        self,
-        prompt,
-        negative_prompt,
-        prompt_embeds,
-        negative_prompt_embeds,
-        num_frames,
-    ) -> None:
-        if not isinstance(prompt, (str, dict)):
-            raise ValueError(f"Expected `prompt` to have type `str` or `dict` but found {type(prompt)=}")
-
-        if negative_prompt is not None:
-            if not isinstance(negative_prompt, (str, dict)):
-                raise ValueError(
-                    f"Expected `negative_prompt` to have type `str` or `dict` but found {type(negative_prompt)=}"
-                )
-
-        if prompt_embeds is not None or negative_prompt_embeds is not None:
-            raise ValueError("`prompt_embeds` and `negative_prompt_embeds` is not supported in FreeNoise yet.")
-
-        frame_indices = [isinstance(x, int) for x in prompt.keys()]
-        frame_prompts = [isinstance(x, str) for x in prompt.values()]
-        min_frame = min(list(prompt.keys()))
-        max_frame = max(list(prompt.keys()))
-
-        if not all(frame_indices):
-            raise ValueError("Expected integer keys in `prompt` dict for FreeNoise.")
-        if not all(frame_prompts):
-            raise ValueError("Expected str values in `prompt` dict for FreeNoise.")
-        if min_frame != 0:
-            raise ValueError("The minimum frame index in `prompt` dict must be 0 as a starting prompt is necessary.")
-        if max_frame >= num_frames:
-            raise ValueError(
-                f"The maximum frame index in `prompt` dict must be lesser than {num_frames=} and follow 0-based indexing."
-            )
-
-    def _encode_prompt_free_noise(
-        self,
-        prompt: Union[str, Dict[int, str]],
-        num_frames: int,
-        device: torch.device,
-        num_videos_per_prompt: int,
-        do_classifier_free_guidance: bool,
-        negative_prompt: Optional[Union[str, Dict[int, str]]] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        lora_scale: Optional[float] = None,
-        clip_skip: Optional[int] = None,
-    ) -> torch.Tensor:
-        if negative_prompt is None:
-            negative_prompt = ""
-
-        # Ensure that we have a dictionary of prompts
-        if isinstance(prompt, str):
-            prompt = {0: prompt}
-        if isinstance(negative_prompt, str):
-            negative_prompt = {0: negative_prompt}
-
-        self._check_inputs_free_noise(prompt, negative_prompt, prompt_embeds, negative_prompt_embeds, num_frames)
-
-        # Sort the prompts based on frame indices
-        prompt = dict(sorted(prompt.items()))
-        negative_prompt = dict(sorted(negative_prompt.items()))
-
-        # Ensure that we have a prompt for the last frame index
-        prompt[num_frames - 1] = prompt[list(prompt.keys())[-1]]
-        negative_prompt[num_frames - 1] = negative_prompt[list(negative_prompt.keys())[-1]]
-
-        frame_indices = list(prompt.keys())
-        frame_prompts = list(prompt.values())
-        frame_negative_indices = list(negative_prompt.keys())
-        frame_negative_prompts = list(negative_prompt.values())
-
-        # Generate and interpolate positive prompts
-        prompt_embeds, _ = self.encode_prompt(
-            prompt=frame_prompts,
-            device=device,
-            num_images_per_prompt=num_videos_per_prompt,
-            do_classifier_free_guidance=False,
-            negative_prompt=None,
-            prompt_embeds=None,
-            negative_prompt_embeds=None,
-            lora_scale=lora_scale,
-            clip_skip=clip_skip,
-        )
-
-        shape = (num_frames, *prompt_embeds.shape[1:])
-        prompt_interpolation_embeds = prompt_embeds.new_zeros(shape)
-
-        for i in range(len(frame_indices) - 1):
-            start_frame = frame_indices[i]
-            end_frame = frame_indices[i + 1]
-            start_tensor = prompt_embeds[i].unsqueeze(0)
-            end_tensor = prompt_embeds[i + 1].unsqueeze(0)
-
-            prompt_interpolation_embeds[start_frame : end_frame + 1] = self._free_noise_prompt_interpolation_callback(
-                start_frame, end_frame, start_tensor, end_tensor
-            )
-
-        # Generate and interpolate negative prompts
-        negative_prompt_embeds = None
-        negative_prompt_interpolation_embeds = None
-
-        if do_classifier_free_guidance:
-            _, negative_prompt_embeds = self.encode_prompt(
-                prompt=[""] * len(frame_negative_prompts),
-                device=device,
-                num_images_per_prompt=num_videos_per_prompt,
-                do_classifier_free_guidance=True,
-                negative_prompt=frame_negative_prompts,
-                prompt_embeds=None,
-                negative_prompt_embeds=None,
-                lora_scale=lora_scale,
-                clip_skip=clip_skip,
-            )
-
-            negative_prompt_interpolation_embeds = negative_prompt_embeds.new_zeros(shape)
-
-            for i in range(len(frame_negative_indices) - 1):
-                start_frame = frame_negative_indices[i]
-                end_frame = frame_negative_indices[i + 1]
-                start_tensor = negative_prompt_embeds[i].unsqueeze(0)
-                end_tensor = negative_prompt_embeds[i + 1].unsqueeze(0)
-
-                negative_prompt_interpolation_embeds[
-                    start_frame : end_frame + 1
-                ] = self._free_noise_prompt_interpolation_callback(start_frame, end_frame, start_tensor, end_tensor)
-
-        prompt_embeds = prompt_interpolation_embeds
-        negative_prompt_embeds = negative_prompt_interpolation_embeds
-
-        if do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
-
-        return prompt_embeds, negative_prompt_embeds
-
    def _prepare_latents_free_noise(
        self,
        batch_size: int,
@@ -309,29 +172,12 @@ class AnimateDiffFreeNoiseMixin:
        latents = latents[:, :, :num_frames]
        return latents

-    def _lerp(
-        self, start_index: int, end_index: int, start_tensor: torch.Tensor, end_tensor: torch.Tensor
-    ) -> torch.Tensor:
-        num_indices = end_index - start_index + 1
-        interpolated_tensors = []
-
-        for i in range(num_indices):
-            alpha = i / (num_indices - 1)
-            interpolated_tensor = (1 - alpha) * start_tensor + alpha * end_tensor
-            interpolated_tensors.append(interpolated_tensor)
-
-        interpolated_tensors = torch.cat(interpolated_tensors)
-        return interpolated_tensors
-
    def enable_free_noise(
        self,
        context_length: Optional[int] = 16,
        context_stride: int = 4,
        weighting_scheme: str = "pyramid",
        noise_type: str = "shuffle_context",
-        prompt_interpolation_callback: Optional[
-            Callable[[DiffusionPipeline, int, int, torch.Tensor, torch.Tensor], torch.Tensor]
-        ] = None,
    ) -> None:
        r"""
        Enable long video generation using FreeNoise.
@@ -349,27 +195,13 @@ class AnimateDiffFreeNoiseMixin:
            weighting_scheme (`str`, defaults to `pyramid`):
                Weighting scheme for averaging latents after accumulation in FreeNoise blocks. The following weighting
                schemes are supported currently:
-                    - "flat"
-                       Performs weighting averaging with a flat weight pattern: [1, 1, 1, 1, 1].
                    - "pyramid"
-                        Performs weighted averaging with a pyramid like weight pattern: [1, 2, 3, 2, 1].
-                    - "delayed_reverse_sawtooth"
-                        Performs weighted averaging with low weights for earlier frames and high-to-low weights for
-                        later frames: [0.01, 0.01, 3, 2, 1].
+                        Peforms weighted averaging with a pyramid like weight pattern: [1, 2, 3, 2, 1].
            noise_type (`str`, defaults to "shuffle_context"):
-                Must be one of ["shuffle_context", "repeat_context", "random"].
-                    - "shuffle_context"
-                        Shuffles a fixed batch of `context_length` latents to create a final latent of size
-                        `num_frames`. This is usually the best setting for most generation scenarious. However, there
-                        might be visible repetition noticeable in the kinds of motion/animation generated.
-                    - "repeated_context"
-                        Repeats a fixed batch of `context_length` latents to create a final latent of size
-                        `num_frames`.
-                    - "random"
-                        The final latents are random without any repetition.
+                TODO
        """

-        allowed_weighting_scheme = ["flat", "pyramid", "delayed_reverse_sawtooth"]
+        allowed_weighting_scheme = ["pyramid"]
        allowed_noise_type = ["shuffle_context", "repeat_context", "random"]

        if context_length > self.motion_adapter.config.motion_max_seq_length:
@@ -387,25 +219,14 @@ class AnimateDiffFreeNoiseMixin:
        self._free_noise_context_stride = context_stride
        self._free_noise_weighting_scheme = weighting_scheme
        self._free_noise_noise_type = noise_type
-        self._free_noise_prompt_interpolation_callback = prompt_interpolation_callback or self._lerp
-
-        if hasattr(self.unet.mid_block, "motion_modules"):
-            blocks = [*self.unet.down_blocks, self.unet.mid_block, *self.unet.up_blocks]
-        else:
-            blocks = [*self.unet.down_blocks, *self.unet.up_blocks]

+        blocks = [*self.unet.down_blocks, self.unet.mid_block, *self.unet.up_blocks]
        for block in blocks:
            self._enable_free_noise_in_block(block)

    def disable_free_noise(self) -> None:
-        r"""Disable the FreeNoise sampling mechanism."""
        self._free_noise_context_length = None

-        if hasattr(self.unet.mid_block, "motion_modules"):
-            blocks = [*self.unet.down_blocks, self.unet.mid_block, *self.unet.up_blocks]
-        else:
-            blocks = [*self.unet.down_blocks, *self.unet.up_blocks]
-
        blocks = [*self.unet.down_blocks, self.unet.mid_block, *self.unet.up_blocks]
        for block in blocks:
            self._disable_free_noise_in_block(block)
@@ -734,8 +734,6 @@ class AnimateDiffPAGPipeline(
        elif self.do_classifier_free_guidance:
            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

-        prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
-
        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            ip_adapter_image_embeds = self.prepare_ip_adapter_image_embeds(
                ip_adapter_image,
@@ -807,9 +805,7 @@ class AnimateDiffPAGPipeline(
            with self.progress_bar(total=self._num_timesteps) as progress_bar:
                for i, t in enumerate(timesteps):
                    # expand the latents if we are doing classifier free guidance
-                    latent_model_input = torch.cat(
-                        [latents] * (prompt_embeds.shape[0] // num_frames // latents.shape[0])
-                    )
+                    latent_model_input = torch.cat([latents] * (prompt_embeds.shape[0] // latents.shape[0]))
                    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

                    # predict the noise residual
@@ -824,8 +824,6 @@ class PIAPipeline(
        if self.do_classifier_free_guidance:
            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])

-        prompt_embeds = prompt_embeds.repeat_interleave(repeats=num_frames, dim=0)
-
        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
            image_embeds = self.prepare_ip_adapter_image_embeds(
                ip_adapter_image,
@@ -418,11 +418,11 @@ class EMAModel:
        one_minus_decay = 1 - decay

        context_manager = contextlib.nullcontext
-        if is_transformers_available() and transformers.integrations.deepspeed.is_deepspeed_zero3_enabled():
+        if is_transformers_available() and transformers.deepspeed.is_deepspeed_zero3_enabled():
            import deepspeed

        if self.foreach:
-            if is_transformers_available() and transformers.integrations.deepspeed.is_deepspeed_zero3_enabled():
+            if is_transformers_available() and transformers.deepspeed.is_deepspeed_zero3_enabled():
                context_manager = deepspeed.zero.GatheredParameters(parameters, modifier_rank=None)

            with context_manager():
@@ -444,7 +444,7 @@ class EMAModel:

        else:
            for s_param, param in zip(self.shadow_params, parameters):
-                if is_transformers_available() and transformers.integrations.deepspeed.is_deepspeed_zero3_enabled():
+                if is_transformers_available() and transformers.deepspeed.is_deepspeed_zero3_enabled():
                    context_manager = deepspeed.zero.GatheredParameters(param, modifier_rank=None)

                with context_manager():
@@ -417,9 +417,6 @@ class ModelTesterMixin:

    @require_torch_gpu
    def test_set_attn_processor_for_determinism(self):
-        if self.uses_custom_attn_processor:
-            return
-
        torch.use_deterministic_algorithms(False)
        if self.forward_requires_fresh_args:
            model = self.model_class(**self.init_dict)
@@ -32,9 +32,6 @@ class FluxTransformerTests(ModelTesterMixin, unittest.TestCase):
    # We override the items here because the transformer under consideration is small.
    model_split_percents = [0.7, 0.6, 0.6]

-    # Skip setting testing with default: AttnProcessor
-    uses_custom_attn_processor = True
-
    @property
    def dummy_input(self):
        batch_size = 1
@@ -51,7 +51,7 @@ class UNetMotionModelTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase)

        noise = floats_tensor((batch_size, num_channels, num_frames) + sizes).to(torch_device)
        time_step = torch.tensor([10]).to(torch_device)
-        encoder_hidden_states = floats_tensor((batch_size * num_frames, 4, 16)).to(torch_device)
+        encoder_hidden_states = floats_tensor((batch_size, 4, 16)).to(torch_device)

        return {"sample": noise, "timestep": time_step, "encoder_hidden_states": encoder_hidden_states}

@@ -460,29 +460,6 @@ class AnimateDiffPipelineFastTests(
                    "Disabling of FreeNoise should lead to results similar to the default pipeline results",
                )

-    def test_free_noise_multi_prompt(self):
-        components = self.get_dummy_components()
-        pipe: AnimateDiffPipeline = self.pipeline_class(**components)
-        pipe.set_progress_bar_config(disable=None)
-        pipe.to(torch_device)
-
-        context_length = 8
-        context_stride = 4
-        pipe.enable_free_noise(context_length, context_stride)
-
-        # Make sure that pipeline works when prompt indices are within num_frames bounds
-        inputs = self.get_dummy_inputs(torch_device)
-        inputs["prompt"] = {0: "Caterpillar on a leaf", 10: "Butterfly on a leaf"}
-        inputs["num_frames"] = 16
-        pipe(**inputs).frames[0]
-
-        with self.assertRaises(ValueError):
-            # Ensure that prompt indices are within bounds
-            inputs = self.get_dummy_inputs(torch_device)
-            inputs["num_frames"] = 16
-            inputs["prompt"] = {0: "Caterpillar on a leaf", 10: "Butterfly on a leaf", 42: "Error on a leaf"}
-            pipe(**inputs).frames[0]
-
    @unittest.skipIf(
        torch_device != "cuda" or not is_xformers_available(),
        reason="XFormers attention is only available with CUDA and `xformers` installed",
@@ -476,27 +476,6 @@ class AnimateDiffControlNetPipelineFastTests(
                    "Disabling of FreeNoise should lead to results similar to the default pipeline results",
                )

-    def test_free_noise_multi_prompt(self):
-        components = self.get_dummy_components()
-        pipe: AnimateDiffControlNetPipeline = self.pipeline_class(**components)
-        pipe.set_progress_bar_config(disable=None)
-        pipe.to(torch_device)
-
-        context_length = 8
-        context_stride = 4
-        pipe.enable_free_noise(context_length, context_stride)
-
-        # Make sure that pipeline works when prompt indices are within num_frames bounds
-        inputs = self.get_dummy_inputs(torch_device, num_frames=16)
-        inputs["prompt"] = {0: "Caterpillar on a leaf", 10: "Butterfly on a leaf"}
-        pipe(**inputs).frames[0]
-
-        with self.assertRaises(ValueError):
-            # Ensure that prompt indices are within bounds
-            inputs = self.get_dummy_inputs(torch_device, num_frames=16)
-            inputs["prompt"] = {0: "Caterpillar on a leaf", 10: "Butterfly on a leaf", 42: "Error on a leaf"}
-            pipe(**inputs).frames[0]
-
    def test_vae_slicing(self, video_count=2):
        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
        components = self.get_dummy_components()
@@ -491,28 +491,3 @@ class AnimateDiffVideoToVideoPipelineFastTests(
                    1e-4,
                    "Disabling of FreeNoise should lead to results similar to the default pipeline results",
                )
-
-    def test_free_noise_multi_prompt(self):
-        components = self.get_dummy_components()
-        pipe: AnimateDiffVideoToVideoPipeline = self.pipeline_class(**components)
-        pipe.set_progress_bar_config(disable=None)
-        pipe.to(torch_device)
-
-        context_length = 8
-        context_stride = 4
-        pipe.enable_free_noise(context_length, context_stride)
-
-        # Make sure that pipeline works when prompt indices are within num_frames bounds
-        inputs = self.get_dummy_inputs(torch_device, num_frames=16)
-        inputs["prompt"] = {0: "Caterpillar on a leaf", 10: "Butterfly on a leaf"}
-        inputs["num_inference_steps"] = 2
-        inputs["strength"] = 0.5
-        pipe(**inputs).frames[0]
-
-        with self.assertRaises(ValueError):
-            # Ensure that prompt indices are within bounds
-            inputs = self.get_dummy_inputs(torch_device, num_frames=16)
-            inputs["num_inference_steps"] = 2
-            inputs["strength"] = 0.5
-            inputs["prompt"] = {0: "Caterpillar on a leaf", 10: "Butterfly on a leaf", 42: "Error on a leaf"}
-            pipe(**inputs).frames[0]
@@ -25,9 +25,6 @@ class FluxPipelineFastTests(unittest.TestCase, PipelineTesterMixin):
    params = frozenset(["prompt", "height", "width", "guidance_scale", "prompt_embeds", "pooled_prompt_embeds"])
    batch_params = frozenset(["prompt"])

-    # there is no xformers processor for Flux
-    test_xformers_attention = False
-
    def get_dummy_components(self):
        torch.manual_seed(0)
        transformer = FluxTransformer2DModel(
@@ -37,7 +37,6 @@ class StableDiffusion3PAGPipelineFastTests(unittest.TestCase, PipelineTesterMixi
        ]
    )
    batch_params = frozenset(["prompt", "negative_prompt"])
-    test_xformers_attention = False

    def get_dummy_components(self):
        torch.manual_seed(0)
@@ -68,8 +68,6 @@ class StableAudioPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
            "callback_steps",
        ]
    )
-    # There is not xformers version of the StableAudioPipeline custom attention processor
-    test_xformers_attention = False

    def get_dummy_components(self):
        torch.manual_seed(0)
Author	SHA1	Message	Date
Dhruv Nair	4f3ca88cb3	update	2024-08-26 13:10:55 +00:00
Dhruv Nair	e9c4feaed1	update	2024-08-26 06:22:50 +00:00