Merge branch 'main' into folderize-loaders

consistency
[Hi-Dream LoRA] fix bug in validation (#11439 )
2025-04-29 00:28:06 +08:00 · 2025-04-29 00:26:28 +08:00 · 2025-04-28 06:22:32 -10:00 · 2025-04-29 00:11:52 +08:00 · 2025-04-29 00:04:19 +08:00 · 2025-04-29 00:02:38 +08:00
144 changed files with 22351 additions and 11902 deletions
@@ -180,6 +180,55 @@ jobs:
        pip install slack_sdk tabulate
        python utils/log_reports.py >> $GITHUB_STEP_SUMMARY

+  run_torch_compile_tests:
+    name: PyTorch Compile CUDA tests
+
+    runs-on:
+      group: aws-g4dn-2xlarge
+
+    container:
+      image: diffusers/diffusers-pytorch-compile-cuda
+      options: --gpus 0 --shm-size "16gb" --ipc host
+
+    steps:
+    - name: Checkout diffusers
+      uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+
+    - name: NVIDIA-SMI
+      run: |
+        nvidia-smi
+    - name: Install dependencies
+      run: |
+        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
+        python -m uv pip install -e [quality,test,training]
+    - name: Environment
+      run: |
+        python utils/print_env.py
+    - name: Run torch compile tests on GPU
+      env:
+        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+        RUN_COMPILE: yes
+      run: |
+        python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile -s -v -k "compile" --make-reports=tests_torch_compile_cuda tests/
+    - name: Failure short reports
+      if: ${{ failure() }}
+      run: cat reports/tests_torch_compile_cuda_failures_short.txt
+
+    - name: Test suite reports artifacts
+      if: ${{ always() }}
+      uses: actions/upload-artifact@v4
+      with:
+        name: torch_compile_test_reports
+        path: reports
+
+    - name: Generate Report and Notify Channel
+      if: always()
+      run: |
+        pip install slack_sdk tabulate
+        python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
+  
  run_big_gpu_torch_tests:
    name: Torch tests on big GPU
    strategy:
@@ -335,7 +335,7 @@ jobs:
    - name: Environment
      run: |
        python utils/print_env.py
-    - name: Run example tests on GPU
+    - name: Run torch compile tests on GPU
      env:
        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
        RUN_COMPILE: yes
@@ -22,11 +22,11 @@ Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading]

 ## IPAdapterMixin

-[[autodoc]] loaders.ip_adapter.IPAdapterMixin
+[[autodoc]] loaders.ip_adapter.ip_adapter.IPAdapterMixin

 ## SD3IPAdapterMixin

-[[autodoc]] loaders.ip_adapter.SD3IPAdapterMixin
+[[autodoc]] loaders.ip_adapter.ip_adapter.SD3IPAdapterMixin
    - all
    - is_ip_adapter_active

@@ -28,6 +28,7 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
 - [`WanLoraLoaderMixin`] provides similar functions for [Wan](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan).
 - [`CogView4LoraLoaderMixin`] provides similar functions for [CogView4](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogview4).
 - [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`].
+- [`HiDreamImageLoraLoaderMixin`] provides similar functions for [HiDream Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hidream)
 - [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more.

 <Tip>
@@ -38,59 +39,71 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse

 ## StableDiffusionLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.StableDiffusionLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.StableDiffusionLoraLoaderMixin

 ## StableDiffusionXLLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.StableDiffusionXLLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.StableDiffusionXLLoraLoaderMixin

 ## SD3LoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.SD3LoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.SD3LoraLoaderMixin

 ## FluxLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.FluxLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.FluxLoraLoaderMixin

 ## CogVideoXLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.CogVideoXLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.CogVideoXLoraLoaderMixin

 ## Mochi1LoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.Mochi1LoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.Mochi1LoraLoaderMixin
 ## AuraFlowLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.AuraFlowLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.AuraFlowLoraLoaderMixin

 ## LTXVideoLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.LTXVideoLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.LTXVideoLoraLoaderMixin

 ## SanaLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.SanaLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.SanaLoraLoaderMixin

 ## HunyuanVideoLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.HunyuanVideoLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.HunyuanVideoLoraLoaderMixin

 ## Lumina2LoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.Lumina2LoraLoaderMixin
-
-## CogView4LoraLoaderMixin
-
-[[autodoc]] loaders.lora_pipeline.CogView4LoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.Lumina2LoraLoaderMixin

 ## WanLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.WanLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.WanLoraLoaderMixin
+
+## CogView4LoraLoaderMixin
+
+[[autodoc]] loaders.lora.lora_pipeline.CogView4LoraLoaderMixin
+
+## CogView4LoraLoaderMixin
+
+[[autodoc]] loaders.lora.lora_pipeline.CogView4LoraLoaderMixin
+
+## WanLoraLoaderMixin
+
+[[autodoc]] loaders.lora.lora_pipeline.WanLoraLoaderMixin

 ## AmusedLoraLoaderMixin

-[[autodoc]] loaders.lora_pipeline.AmusedLoraLoaderMixin
+[[autodoc]] loaders.lora.lora_pipeline.AmusedLoraLoaderMixin
+
+## HiDreamImageLoraLoaderMixin
+
+[[autodoc]] loaders.lora_pipeline.HiDreamImageLoraLoaderMixin

 ## LoraBaseMixin

-[[autodoc]] loaders.lora_base.LoraBaseMixin
+[[autodoc]] loaders.lora.lora_base.LoraBaseMixin
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.

 # SD3Transformer2D

-This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.
+This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and [SD3Transformer2DModel], check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.

 The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.

@@ -24,6 +24,6 @@ To learn more about how to load LoRA weights, see the [LoRA](../../using-diffuse

 ## SD3Transformer2DLoadersMixin

-[[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin
+[[autodoc]] loaders.ip_adapter.transformer_sd3.SD3Transformer2DLoadersMixin
    - all
    - _load_ip_adapter_weights
@@ -347,7 +347,7 @@ image = pipe(
    height=1024,
    prompt="wearing sunglasses",
    negative_prompt="",
-    true_cfg=4.0,
+    true_cfg_scale=4.0,
    generator=torch.Generator().manual_seed(4444),
    ip_adapter_image=image,
 ).images[0]
@@ -24,7 +24,7 @@

 ## Generating Videos with Wan 2.1

-We will first need to install some addtional dependencies.
+We will first need to install some additional dependencies.

 ```shell
 pip install -u ftfy imageio-ffmpeg imageio
@@ -216,7 +216,7 @@ Setting the `<ID_TOKEN>` is not necessary. From some limited experimentation, we
 > - The original repository uses a `lora_alpha` of `1`. We found this not suitable in many runs, possibly due to difference in modeling backends and training settings. Our recommendation is to set to the `lora_alpha` to either `rank` or `rank // 2`.
 > - If you're training on data whose captions generate bad results with the original model, a `rank` of 64 and above is good and also the recommendation by the team behind CogVideoX. If the generations are already moderately good on your training captions, a `rank` of 16/32 should work. We found that setting the rank too low, say `4`, is not ideal and doesn't produce promising results.
 > - The authors of CogVideoX recommend 4000 training steps and 100 training videos overall to achieve the best result. While that might yield the best results, we found from our limited experimentation that 2000 steps and 25 videos could also be sufficient.
-> - When using the Prodigy opitimizer for training, one can follow the recommendations from [this](https://huggingface.co/blog/sdxl_lora_advanced_script) blog. Prodigy tends to overfit quickly. From my very limited testing, I found a learning rate of `0.5` to be suitable in addition to `--prodigy_use_bias_correction`, `prodigy_safeguard_warmup` and `--prodigy_decouple`.
+> - When using the Prodigy optimizer for training, one can follow the recommendations from [this](https://huggingface.co/blog/sdxl_lora_advanced_script) blog. Prodigy tends to overfit quickly. From my very limited testing, I found a learning rate of `0.5` to be suitable in addition to `--prodigy_use_bias_correction`, `prodigy_safeguard_warmup` and `--prodigy_decouple`.
 > - The recommended learning rate by the CogVideoX authors and from our experimentation with Adam/AdamW is between `1e-3` and `1e-4` for a dataset of 25+ videos.
 >
 > Note that our testing is not exhaustive due to limited time for exploration. Our recommendation would be to play around with the different knobs and dials to find the best settings for your data.
@@ -589,7 +589,7 @@ For stage 2 of DeepFloyd IF with DreamBooth, pay attention to these parameters:

 * `--learning_rate=5e-6`, use a lower learning rate with a smaller effective batch size
 * `--resolution=256`, the expected resolution for the upscaler
-* `--train_batch_size=2` and `--gradient_accumulation_steps=6`, to effectively train on images wiht faces requires larger batch sizes
+* `--train_batch_size=2` and `--gradient_accumulation_steps=6`, to effectively train on images with faces requires larger batch sizes

 ```bash
 export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
@@ -89,7 +89,7 @@ Many of the basic and important parameters are described in the [Text-to-image](

 As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the T2I-Adapter relevant parts of the script.

-The training script begins by preparing the dataset. This incudes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images.
+The training script begins by preparing the dataset. This includes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images.

 ```py
 conditioning_image_transforms = transforms.Compose(
@@ -2181,7 +2181,7 @@ def main(args):
                # Predict the noise residual
                model_pred = transformer(
                    hidden_states=packed_noisy_model_input,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
                    timestep=timesteps / 1000,
                    guidance=guidance,
                    pooled_projections=pooled_prompt_embeds,
@@ -5381,7 +5381,7 @@ pipe = DiffusionPipeline.from_pretrained(
 # Here we need use pipeline internal unet model
 pipe.unet = pipe.unet_model.from_pretrained(model_id, subfolder="unet", variant="fp16", use_safetensors=True)

-# Load aditional layers to the model
+# Load additional layers to the model
 pipe.unet.load_additional_layers(weight_path="proc_data/faithdiff/FaithDiff.bin", dtype=dtype)

 # Enable vae tiling
@@ -312,9 +312,9 @@ if __name__ == "__main__":
                    # These are the coordinates of the output image
                    out_coordinates = np.arange(1, out_length + 1)

-                    # since both scale-factor and output size can be provided simulatneously, perserving the center of the image requires shifting
-                    # the output coordinates. the deviation is because out_length doesn't necesary equal in_length*scale.
-                    # to keep the center we need to subtract half of this deivation so that we get equal margins for boths sides and center is preserved.
+                    # since both scale-factor and output size can be provided simultaneously, preserving the center of the image requires shifting
+                    # the output coordinates. the deviation is because out_length doesn't necessary equal in_length*scale.
+                    # to keep the center we need to subtract half of this deviation so that we get equal margins for both sides and center is preserved.
                    shifted_out_coordinates = out_coordinates - (out_length - in_length * scale) / 2

                    # These are the matching positions of the output-coordinates on the input image coordinates.
@@ -351,7 +351,7 @@ def my_forward(
            cross_attention_kwargs (`dict`, *optional*):
                A kwargs dictionary that if specified is passed along to the [`AttnProcessor`].
            added_cond_kwargs: (`dict`, *optional*):
-                A kwargs dictionary containin additional embeddings that if specified are added to the embeddings that
+                A kwargs dictionary containing additional embeddings that if specified are added to the embeddings that
                are passed along to the UNet blocks.

        Returns:
@@ -864,9 +864,9 @@ def get_flow_and_interframe_paras(flow_model, imgs):
 class AttentionControl:
    """
    Control FRESCO-based attention
-    * enable/diable spatial-guided attention
-    * enable/diable temporal-guided attention
-    * enable/diable cross-frame attention
+    * enable/disable spatial-guided attention
+    * enable/disable temporal-guided attention
+    * enable/disable cross-frame attention
    * collect intermediate attention feature (for spatial-guided attention)
    """

@@ -34,7 +34,7 @@ class RASGAttnProcessor:
        temb: Optional[torch.Tensor] = None,
        scale: float = 1.0,
    ) -> torch.Tensor:
-        # Same as the default AttnProcessor up untill the part where similarity matrix gets saved
+        # Same as the default AttnProcessor up until the part where similarity matrix gets saved
        downscale_factor = self.mask_resoltuion // hidden_states.shape[1]
        residual = hidden_states

@@ -889,7 +889,7 @@ def main(args):
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
-        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be devide by the number of processes assuming batches are multiplied by the number of processes
+        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be divided by the number of processes assuming batches are multiplied by the number of processes
    )

    # Make one log on every process with the configuration for debugging.
@@ -721,7 +721,7 @@ def main(args):
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
-        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be devide by the number of processes assuming batches are multiplied by the number of processes
+        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be divided by the number of processes assuming batches are multiplied by the number of processes
    )

    # Make one log on every process with the configuration for debugging.
@@ -884,7 +884,7 @@ def main(args):
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
-        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be devide by the number of processes assuming batches are multiplied by the number of processes
+        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be divided by the number of processes assuming batches are multiplied by the number of processes
    )

    # Make one log on every process with the configuration for debugging.
@@ -854,7 +854,7 @@ def main(args):
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
-        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be devide by the number of processes assuming batches are multiplied by the number of processes
+        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be divided by the number of processes assuming batches are multiplied by the number of processes
    )

    # Make one log on every process with the configuration for debugging.
@@ -894,7 +894,7 @@ def main(args):
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
-        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be devide by the number of processes assuming batches are multiplied by the number of processes
+        split_batches=True,  # It's important to set this to True when using webdataset to get the right number of steps for lr scheduling. If set to False, the number of steps will be divided by the number of processes assuming batches are multiplied by the number of processes
    )

    # Make one log on every process with the configuration for debugging.
@@ -6,7 +6,19 @@ Training script provided by LibAI, which is an institution dedicated to the prog
 > [!NOTE]
 > **Memory consumption**
 >
-> Flux can be quite expensive to run on consumer hardware devices and as a result, ControlNet training of it comes with higher memory requirements than usual. 
+> Flux can be quite expensive to run on consumer hardware devices and as a result, ControlNet training of it comes with higher memory requirements than usual.
+
+Here is a gpu memory consumption for reference, tested on a single A100 with 80G.
+
+| period | GPU |
+| - | - | 
+| load as float32 | ~70G |
+| mv transformer and vae to bf16 | ~48G |
+| pre compute txt embeddings | ~62G |
+| **offload te to cpu** | ~30G |
+| training | ~58G |
+| validation | ~71G |
+

 > **Gated access**
 >
@@ -98,8 +110,9 @@ accelerate launch train_controlnet_flux.py \
    --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
    --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
    --train_batch_size=1 \
-    --gradient_accumulation_steps=4 \
+    --gradient_accumulation_steps=16 \
    --report_to="wandb" \
+    --lr_scheduler="cosine" \
    --num_double_layers=4 \
    --num_single_layers=0 \
    --seed=42 \
@@ -148,7 +148,7 @@ def log_validation(
                    pooled_prompt_embeds=pooled_prompt_embeds,
                    control_image=validation_image,
                    num_inference_steps=28,
-                    controlnet_conditioning_scale=0.7,
+                    controlnet_conditioning_scale=1,
                    guidance_scale=3.5,
                    generator=generator,
                ).images[0]
@@ -1085,8 +1085,6 @@ def main(args):
        return {"prompt_embeds": prompt_embeds, "pooled_prompt_embeds": pooled_prompt_embeds, "text_ids": text_ids}

    train_dataset = get_train_dataset(args, accelerator)
-    text_encoders = [text_encoder_one, text_encoder_two]
-    tokenizers = [tokenizer_one, tokenizer_two]
    compute_embeddings_fn = functools.partial(
        compute_embeddings,
        flux_controlnet_pipeline=flux_controlnet_pipeline,
@@ -1103,7 +1101,8 @@ def main(args):
            compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint, batch_size=50
        )

-    del text_encoders, tokenizers, text_encoder_one, text_encoder_two, tokenizer_one, tokenizer_two
+    text_encoder_one.to("cpu")
+    text_encoder_two.to("cpu")
    free_memory()

    # Then get the training dataset ready to be passed to the dataloader.
@@ -0,0 +1,119 @@
+# DreamBooth training example for HiDream Image
+
+[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text2image models like stable diffusion given just a few (3~5) images of a subject.
+
+The `train_dreambooth_lora_hidream.py` script shows how to implement the training procedure with [LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) and adapt it for [HiDream Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/). 
+
+
+This will also allow us to push the trained model parameters to the Hugging Face Hub platform.
+
+## Running locally with PyTorch
+
+### Installing the dependencies
+
+Before running the scripts, make sure to install the library's training dependencies:
+
+**Important**
+
+To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
+
+```bash
+git clone https://github.com/huggingface/diffusers
+cd diffusers
+pip install -e .
+```
+
+Then cd in the `examples/dreambooth` folder and run
+```bash
+pip install -r requirements_hidream.txt
+```
+
+And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
+
+```bash
+accelerate config
+```
+
+Or for a default accelerate configuration without answering questions about your environment
+
+```bash
+accelerate config default
+```
+
+Or if your environment doesn't support an interactive shell (e.g., a notebook)
+
+```python
+from accelerate.utils import write_basic_config
+write_basic_config()
+```
+
+When running `accelerate config`, if we specify torch compile mode to True there can be dramatic speedups.
+Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.14.0` installed in your environment.
+
+
+### 3d icon example
+
+For this example we will use some 3d icon images: https://huggingface.co/datasets/linoyts/3d_icon.
+
+This will also allow us to push the trained LoRA parameters to the Hugging Face Hub platform.
+
+Now, we can launch training using:
+> [!NOTE]
+> The following training configuration prioritizes lower memory consumption by using gradient checkpointing, 
+> 8-bit Adam optimizer, latent caching, offloading, no validation.
+> all text embeddings are pre-computed to save memory.
+```bash
+export MODEL_NAME="HiDream-ai/HiDream-I1-Dev"
+export INSTANCE_DIR="linoyts/3d_icon"
+export OUTPUT_DIR="trained-hidream-lora"
+
+accelerate launch train_dreambooth_lora_hidream.py \
+  --pretrained_model_name_or_path=$MODEL_NAME  \
+  --dataset_name=$INSTANCE_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --mixed_precision="bf16" \
+  --instance_prompt="3d icon" \
+  --caption_column="prompt"\
+  --validation_prompt="a 3dicon, a llama eating ramen" \
+  --resolution=1024 \
+  --train_batch_size=1 \
+  --gradient_accumulation_steps=4 \
+  --use_8bit_adam \
+  --rank=8 \
+  --learning_rate=2e-4 \
+  --report_to="wandb" \
+  --lr_scheduler="constant_with_warmup" \
+  --lr_warmup_steps=100 \
+  --max_train_steps=1000 \
+  --cache_latents\
+  --gradient_checkpointing \
+  --validation_epochs=25 \
+  --seed="0" \
+  --push_to_hub
+```
+
+For using `push_to_hub`, make you're logged into your Hugging Face account:
+
+```bash
+huggingface-cli login
+```
+
+To better track our training experiments, we're using the following flags in the command above:
+
+* `report_to="wandb` will ensure the training runs are tracked on [Weights and Biases](https://wandb.ai/site). To use it, be sure to install `wandb` with `pip install wandb`. Don't forget to call `wandb login <your_api_key>` before training if you haven't done it before.
+* `validation_prompt` and `validation_epochs` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
+
+## Notes
+
+Additionally, we welcome you to explore the following CLI arguments:
+
+* `--lora_layers`: The transformer modules to apply LoRA training on. Please specify the layers in a comma seperated. E.g. - "to_k,to_q,to_v" will result in lora training of attention layers only.
+* `--rank`: The rank of the LoRA layers. The higher the rank, the more parameters are trained. The default is 16.
+
+We provide several options for optimizing memory optimization:
+
+* `--offload`: When enabled, we will offload the text encoder and VAE to CPU, when they are not used.
+* `cache_latents`: When enabled, we will pre-compute the latents from the input images with the VAE and remove the VAE from memory once done.
+* `--use_8bit_adam`: When enabled, we will use the 8bit version of AdamW provided by the `bitsandbytes` library.
+
+Refer to the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/) of the `HiDreamImagePipeline` to know more about the model.
@@ -0,0 +1,8 @@
+accelerate>=1.4.0
+torchvision
+transformers>=4.50.0
+ftfy
+tensorboard
+Jinja2
+peft>=0.14.0
+sentencepiece
@@ -0,0 +1,220 @@
+# coding=utf-8
+# Copyright 2024 HuggingFace Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import sys
+import tempfile
+
+import safetensors
+
+
+sys.path.append("..")
+from test_examples_utils import ExamplesTestsAccelerate, run_command  # noqa: E402
+
+
+logging.basicConfig(level=logging.DEBUG)
+
+logger = logging.getLogger()
+stream_handler = logging.StreamHandler(sys.stdout)
+logger.addHandler(stream_handler)
+
+
+class DreamBoothLoRAHiDreamImage(ExamplesTestsAccelerate):
+    instance_data_dir = "docs/source/en/imgs"
+    pretrained_model_name_or_path = "hf-internal-testing/tiny-hidream-i1-pipe"
+    text_encoder_4_path = "hf-internal-testing/tiny-random-LlamaForCausalLM"
+    tokenizer_4_path = "hf-internal-testing/tiny-random-LlamaForCausalLM"
+    script_path = "examples/dreambooth/train_dreambooth_lora_hidream.py"
+    transformer_layer_type = "double_stream_blocks.0.block.attn1.to_k"
+
+    def test_dreambooth_lora_hidream(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            test_args = f"""
+                {self.script_path}
+                --pretrained_model_name_or_path {self.pretrained_model_name_or_path}
+                --pretrained_text_encoder_4_name_or_path {self.text_encoder_4_path}
+                --pretrained_tokenizer_4_name_or_path {self.tokenizer_4_path}
+                --instance_data_dir {self.instance_data_dir}
+                --resolution 32
+                --train_batch_size 1
+                --gradient_accumulation_steps 1
+                --max_train_steps 2
+                --learning_rate 5.0e-04
+                --scale_lr
+                --lr_scheduler constant
+                --lr_warmup_steps 0
+                --output_dir {tmpdir}
+                --max_sequence_length 16
+                """.split()
+
+            test_args.extend(["--instance_prompt", ""])
+            run_command(self._launch_args + test_args)
+            # save_pretrained smoke test
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))
+
+            # make sure the state_dict has the correct naming in the parameters.
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
+            is_lora = all("lora" in k for k in lora_state_dict.keys())
+            self.assertTrue(is_lora)
+
+            # when not training the text encoder, all the parameters in the state dict should start
+            # with `"transformer"` in their names.
+            starts_with_transformer = all(key.startswith("transformer") for key in lora_state_dict.keys())
+            self.assertTrue(starts_with_transformer)
+
+    def test_dreambooth_lora_latent_caching(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            test_args = f"""
+                {self.script_path}
+                --pretrained_model_name_or_path {self.pretrained_model_name_or_path}
+                --pretrained_text_encoder_4_name_or_path {self.text_encoder_4_path}
+                --pretrained_tokenizer_4_name_or_path {self.tokenizer_4_path}
+                --instance_data_dir {self.instance_data_dir}
+                --resolution 32
+                --train_batch_size 1
+                --gradient_accumulation_steps 1
+                --max_train_steps 2
+                --cache_latents
+                --learning_rate 5.0e-04
+                --scale_lr
+                --lr_scheduler constant
+                --lr_warmup_steps 0
+                --output_dir {tmpdir}
+                --max_sequence_length 16
+                """.split()
+
+            test_args.extend(["--instance_prompt", ""])
+            run_command(self._launch_args + test_args)
+            # save_pretrained smoke test
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))
+
+            # make sure the state_dict has the correct naming in the parameters.
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
+            is_lora = all("lora" in k for k in lora_state_dict.keys())
+            self.assertTrue(is_lora)
+
+            # when not training the text encoder, all the parameters in the state dict should start
+            # with `"transformer"` in their names.
+            starts_with_transformer = all(key.startswith("transformer") for key in lora_state_dict.keys())
+            self.assertTrue(starts_with_transformer)
+
+    def test_dreambooth_lora_layers(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            test_args = f"""
+                {self.script_path}
+                --pretrained_model_name_or_path {self.pretrained_model_name_or_path}
+                --pretrained_text_encoder_4_name_or_path {self.text_encoder_4_path}
+                --pretrained_tokenizer_4_name_or_path {self.tokenizer_4_path}
+                --instance_data_dir {self.instance_data_dir}
+                --resolution 32
+                --train_batch_size 1
+                --gradient_accumulation_steps 1
+                --max_train_steps 2
+                --cache_latents
+                --learning_rate 5.0e-04
+                --scale_lr
+                --lora_layers {self.transformer_layer_type}
+                --lr_scheduler constant
+                --lr_warmup_steps 0
+                --output_dir {tmpdir}
+                --max_sequence_length 16
+                """.split()
+
+            test_args.extend(["--instance_prompt", ""])
+            run_command(self._launch_args + test_args)
+            # save_pretrained smoke test
+            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))
+
+            # make sure the state_dict has the correct naming in the parameters.
+            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
+            is_lora = all("lora" in k for k in lora_state_dict.keys())
+            self.assertTrue(is_lora)
+
+            # when not training the text encoder, all the parameters in the state dict should start
+            # with `"transformer"` in their names. In this test, we only params of
+            # `self.transformer_layer_type` should be in the state dict.
+            starts_with_transformer = all(self.transformer_layer_type in key for key in lora_state_dict)
+            self.assertTrue(starts_with_transformer)
+
+    def test_dreambooth_lora_hidream_checkpointing_checkpoints_total_limit(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            test_args = f"""
+            {self.script_path}
+            --pretrained_model_name_or_path={self.pretrained_model_name_or_path}
+            --pretrained_text_encoder_4_name_or_path {self.text_encoder_4_path}
+            --pretrained_tokenizer_4_name_or_path {self.tokenizer_4_path}
+            --instance_data_dir={self.instance_data_dir}
+            --output_dir={tmpdir}
+            --resolution=32
+            --train_batch_size=1
+            --gradient_accumulation_steps=1
+            --max_train_steps=6
+            --checkpoints_total_limit=2
+            --checkpointing_steps=2
+            --max_sequence_length 16
+            """.split()
+
+            test_args.extend(["--instance_prompt", ""])
+            run_command(self._launch_args + test_args)
+
+            self.assertEqual(
+                {x for x in os.listdir(tmpdir) if "checkpoint" in x},
+                {"checkpoint-4", "checkpoint-6"},
+            )
+
+    def test_dreambooth_lora_hidream_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            test_args = f"""
+            {self.script_path}
+            --pretrained_model_name_or_path={self.pretrained_model_name_or_path}
+            --pretrained_text_encoder_4_name_or_path {self.text_encoder_4_path}
+            --pretrained_tokenizer_4_name_or_path {self.tokenizer_4_path}
+            --instance_data_dir={self.instance_data_dir}
+            --output_dir={tmpdir}
+            --resolution=32
+            --train_batch_size=1
+            --gradient_accumulation_steps=1
+            --max_train_steps=4
+            --checkpointing_steps=2
+            --max_sequence_length 16
+            """.split()
+
+            test_args.extend(["--instance_prompt", ""])
+            run_command(self._launch_args + test_args)
+
+            self.assertEqual({x for x in os.listdir(tmpdir) if "checkpoint" in x}, {"checkpoint-2", "checkpoint-4"})
+
+            resume_run_args = f"""
+            {self.script_path}
+            --pretrained_model_name_or_path={self.pretrained_model_name_or_path}
+            --pretrained_text_encoder_4_name_or_path {self.text_encoder_4_path}
+            --pretrained_tokenizer_4_name_or_path {self.tokenizer_4_path}
+            --instance_data_dir={self.instance_data_dir}
+            --output_dir={tmpdir}
+            --resolution=32
+            --train_batch_size=1
+            --gradient_accumulation_steps=1
+            --max_train_steps=8
+            --checkpointing_steps=2
+            --resume_from_checkpoint=checkpoint-4
+            --checkpoints_total_limit=2
+            --max_sequence_length 16
+            """.split()
+
+            resume_run_args.extend(["--instance_prompt", ""])
+            run_command(self._launch_args + resume_run_args)
+
+            self.assertEqual({x for x in os.listdir(tmpdir) if "checkpoint" in x}, {"checkpoint-6", "checkpoint-8"})
@@ -618,6 +618,15 @@ def parse_args(input_args=None):
        ),
    )
    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+    parser.add_argument(
+        "--image_interpolation_mode",
+        type=str,
+        default="lanczos",
+        choices=[
+            f.lower() for f in dir(transforms.InterpolationMode) if not f.startswith("__") and not f.endswith("__")
+        ],
+        help="The image interpolation method to use for resizing images.",
+    )

    if input_args is not None:
        args = parser.parse_args(input_args)
@@ -737,7 +746,10 @@ class DreamBoothDataset(Dataset):
            self.instance_images.extend(itertools.repeat(img, repeats))

        self.pixel_values = []
-        train_resize = transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR)
+        interpolation = getattr(transforms.InterpolationMode, args.image_interpolation_mode.upper(), None)
+        if interpolation is None:
+            raise ValueError(f"Unsupported interpolation mode {interpolation=}.")
+        train_resize = transforms.Resize(size, interpolation=interpolation)
        train_crop = transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size)
        train_flip = transforms.RandomHorizontalFlip(p=1.0)
        train_transforms = transforms.Compose(
@@ -1622,7 +1634,7 @@ def main(args):
                # Predict the noise residual
                model_pred = transformer(
                    hidden_states=packed_noisy_model_input,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
                    timestep=timesteps / 1000,
                    guidance=guidance,
                    pooled_projections=pooled_prompt_embeds,
@@ -524,6 +524,15 @@ def parse_args(input_args=None):
        default=4,
        help=("The dimension of the LoRA update matrices."),
    )
+    parser.add_argument(
+        "--image_interpolation_mode",
+        type=str,
+        default="lanczos",
+        choices=[
+            f.lower() for f in dir(transforms.InterpolationMode) if not f.startswith("__") and not f.endswith("__")
+        ],
+        help="The image interpolation method to use for resizing images.",
+    )

    if input_args is not None:
        args = parser.parse_args(input_args)
@@ -601,9 +610,13 @@ class DreamBoothDataset(Dataset):
        else:
            self.class_data_root = None

+        interpolation = getattr(transforms.InterpolationMode, args.image_interpolation_mode.upper(), None)
+        if interpolation is None:
+            raise ValueError(f"Unsupported interpolation mode {interpolation=}.")
+
        self.image_transforms = transforms.Compose(
            [
-                transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR),
+                transforms.Resize(size, interpolation=interpolation),
                transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),
                transforms.ToTensor(),
                transforms.Normalize([0.5], [0.5]),
@@ -1749,7 +1749,7 @@ def main(args):
                # Predict the noise residual
                model_pred = transformer(
                    hidden_states=packed_noisy_model_input,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
                    timestep=timesteps / 1000,
                    guidance=guidance,
                    pooled_projections=pooled_prompt_embeds,
@@ -1088,7 +1088,7 @@ def main(args):
                text_ids = batch["text_ids"].to(device=accelerator.device, dtype=weight_dtype)
                model_pred = transformer(
                    hidden_states=packed_noisy_model_input,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transformer model (we should not keep it but I want to keep the inputs same for the model for testing)
                    timestep=timesteps / 1000,
                    guidance=guidance,
                    pooled_projections=pooled_prompt_embeds,
@@ -499,6 +499,15 @@ def parse_args():
            " more information see https://huggingface.co/docs/accelerate/v0.17.0/en/package_reference/accelerator#accelerate.Accelerator"
        ),
    )
+    parser.add_argument(
+        "--image_interpolation_mode",
+        type=str,
+        default="lanczos",
+        choices=[
+            f.lower() for f in dir(transforms.InterpolationMode) if not f.startswith("__") and not f.endswith("__")
+        ],
+        help="The image interpolation method to use for resizing images.",
+    )

    args = parser.parse_args()
    env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
@@ -787,10 +796,17 @@ def main():
        )
        return inputs.input_ids

-    # Preprocessing the datasets.
+    # Get the specified interpolation method from the args
+    interpolation = getattr(transforms.InterpolationMode, args.image_interpolation_mode.upper(), None)
+
+    # Raise an error if the interpolation method is invalid
+    if interpolation is None:
+        raise ValueError(f"Unsupported interpolation mode {args.image_interpolation_mode}.")
+
+    # Data preprocessing transformations
    train_transforms = transforms.Compose(
        [
-            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
+            transforms.Resize(args.resolution, interpolation=interpolation),  # Use dynamic interpolation method
            transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
            transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
            transforms.ToTensor(),
@@ -418,6 +418,15 @@ def parse_args():
        default=4,
        help=("The dimension of the LoRA update matrices."),
    )
+    parser.add_argument(
+        "--image_interpolation_mode",
+        type=str,
+        default="lanczos",
+        choices=[
+            f.lower() for f in dir(transforms.InterpolationMode) if not f.startswith("__") and not f.endswith("__")
+        ],
+        help="The image interpolation method to use for resizing images.",
+    )

    args = parser.parse_args()
    env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
@@ -649,10 +658,17 @@ def main():
        )
        return inputs.input_ids

-    # Preprocessing the datasets.
+    # Get the specified interpolation method from the args
+    interpolation = getattr(transforms.InterpolationMode, args.image_interpolation_mode.upper(), None)
+
+    # Raise an error if the interpolation method is invalid
+    if interpolation is None:
+        raise ValueError(f"Unsupported interpolation mode {args.image_interpolation_mode}.")
+
+    # Data preprocessing transformations
    train_transforms = transforms.Compose(
        [
-            transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
+            transforms.Resize(args.resolution, interpolation=interpolation),  # Use dynamic interpolation method
            transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
            transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
            transforms.ToTensor(),
@@ -57,7 +57,7 @@ class ModuleGroup:
        non_blocking: bool = False,
        stream: Optional[torch.cuda.Stream] = None,
        record_stream: Optional[bool] = False,
-        low_cpu_mem_usage=False,
+        low_cpu_mem_usage: bool = False,
        onload_self: bool = True,
    ) -> None:
        self.modules = modules
@@ -498,6 +498,8 @@ def _apply_group_offloading_block_level(
            option only matters when using streamed CPU offloading (i.e. `use_stream=True`). This can be useful when
            the CPU memory is a bottleneck but may counteract the benefits of using streams.
    """
+    if stream is not None and num_blocks_per_group != 1:
+        raise ValueError(f"Using streams is only supported for num_blocks_per_group=1. Got {num_blocks_per_group=}.")

    # Create module groups for ModuleList and Sequential blocks
    modules_with_group_offloading = set()
@@ -521,7 +523,7 @@ def _apply_group_offloading_block_level(
                stream=stream,
                record_stream=record_stream,
                low_cpu_mem_usage=low_cpu_mem_usage,
-                onload_self=stream is None,
+                onload_self=True,
            )
            matched_module_groups.append(group)
            for j in range(i, i + len(current_modules)):
@@ -529,12 +531,8 @@ def _apply_group_offloading_block_level(

    # Apply group offloading hooks to the module groups
    for i, group in enumerate(matched_module_groups):
-        next_group = (
-            matched_module_groups[i + 1] if i + 1 < len(matched_module_groups) and stream is not None else None
-        )
-
        for group_module in group.modules:
-            _apply_group_offloading_hook(group_module, group, next_group)
+            _apply_group_offloading_hook(group_module, group, None)

    # Parameters and Buffers of the top-level module need to be offloaded/onloaded separately
    # when the forward pass of this module is called. This is because the top-level module is not
@@ -560,8 +558,10 @@ def _apply_group_offloading_block_level(
        record_stream=False,
        onload_self=True,
    )
-    next_group = matched_module_groups[0] if len(matched_module_groups) > 0 else None
-    _apply_group_offloading_hook(module, unmatched_group, next_group)
+    if stream is None:
+        _apply_group_offloading_hook(module, unmatched_group, None)
+    else:
+        _apply_lazy_group_offloading_hook(module, unmatched_group, None)


 def _apply_group_offloading_leaf_level(
@@ -116,6 +116,7 @@ class VaeImageProcessor(ConfigMixin):
        vae_scale_factor: int = 8,
        vae_latent_channels: int = 4,
        resample: str = "lanczos",
+        reducing_gap: int = None,
        do_normalize: bool = True,
        do_binarize: bool = False,
        do_convert_rgb: bool = False,
@@ -498,7 +499,11 @@ class VaeImageProcessor(ConfigMixin):
            raise ValueError(f"Only PIL image input is supported for resize_mode {resize_mode}")
        if isinstance(image, PIL.Image.Image):
            if resize_mode == "default":
-                image = image.resize((width, height), resample=PIL_INTERPOLATION[self.config.resample])
+                image = image.resize(
+                    (width, height),
+                    resample=PIL_INTERPOLATION[self.config.resample],
+                    reducing_gap=self.config.reducing_gap,
+                )
            elif resize_mode == "fill":
                image = self._resize_and_fill(image, width, height)
            elif resize_mode == "crop":
@@ -54,14 +54,14 @@ if is_transformers_available():
 _import_structure = {}

 if is_torch_available():
-    _import_structure["single_file_model"] = ["FromOriginalModelMixin"]
-    _import_structure["transformer_flux"] = ["FluxTransformer2DLoadersMixin"]
-    _import_structure["transformer_sd3"] = ["SD3Transformer2DLoadersMixin"]
-    _import_structure["unet"] = ["UNet2DConditionLoadersMixin"]
+    _import_structure["ip_adapter.transformer_flux"] = ["FluxTransformer2DLoadersMixin"]
+    _import_structure["ip_adapter.transformer_sd3"] = ["SD3Transformer2DLoadersMixin"]
+    _import_structure["single_file.single_file_model"] = ["FromOriginalModelMixin"]
+    _import_structure["unet.unet"] = ["UNet2DConditionLoadersMixin"]
    _import_structure["utils"] = ["AttnProcsLayers"]
    if is_transformers_available():
-        _import_structure["single_file"] = ["FromSingleFileMixin"]
-        _import_structure["lora_pipeline"] = [
+        _import_structure["single_file.single_file"] = ["FromSingleFileMixin"]
+        _import_structure["lora.lora_pipeline"] = [
            "AmusedLoraLoaderMixin",
            "StableDiffusionLoraLoaderMixin",
            "SD3LoraLoaderMixin",
@@ -77,9 +77,10 @@ if is_torch_available():
            "SanaLoraLoaderMixin",
            "Lumina2LoraLoaderMixin",
            "WanLoraLoaderMixin",
+            "HiDreamImageLoraLoaderMixin",
        ]
        _import_structure["textual_inversion"] = ["TextualInversionLoaderMixin"]
-        _import_structure["ip_adapter"] = [
+        _import_structure["ip_adapter.ip_adapter"] = [
            "IPAdapterMixin",
            "FluxIPAdapterMixin",
            "SD3IPAdapterMixin",
@@ -90,25 +91,22 @@ _import_structure["peft"] = ["PeftAdapterMixin"]

 if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
    if is_torch_available():
-        from .single_file_model import FromOriginalModelMixin
-        from .transformer_flux import FluxTransformer2DLoadersMixin
-        from .transformer_sd3 import SD3Transformer2DLoadersMixin
+        from .ip_adapter import FluxTransformer2DLoadersMixin, SD3Transformer2DLoadersMixin
+        from .single_file import FromOriginalModelMixin
        from .unet import UNet2DConditionLoadersMixin
        from .utils import AttnProcsLayers

        if is_transformers_available():
-            from .ip_adapter import (
-                FluxIPAdapterMixin,
-                IPAdapterMixin,
-                SD3IPAdapterMixin,
-            )
-            from .lora_pipeline import (
+            from .ip_adapter import FluxIPAdapterMixin, IPAdapterMixin, SD3IPAdapterMixin
+            from .lora import (
                AmusedLoraLoaderMixin,
                AuraFlowLoraLoaderMixin,
                CogVideoXLoraLoaderMixin,
                CogView4LoraLoaderMixin,
                FluxLoraLoaderMixin,
+                HiDreamImageLoraLoaderMixin,
                HunyuanVideoLoraLoaderMixin,
+                LoraBaseMixin,
                LoraLoaderMixin,
                LTXVideoLoraLoaderMixin,
                Lumina2LoraLoaderMixin,
@@ -12,868 +12,27 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from pathlib import Path
-from typing import Dict, List, Optional, Union

-import torch
-import torch.nn.functional as F
-from huggingface_hub.utils import validate_hf_hub_args
-from safetensors import safe_open
+from ..utils import deprecate
+from .ip_adapter import FluxIPAdapterMixin, IPAdapterMixin, SD3IPAdapterMixin

-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_state_dict
-from ..utils import (
-    USE_PEFT_BACKEND,
-    _get_detailed_type,
-    _get_model_file,
-    _is_valid_type,
-    is_accelerate_available,
-    is_torch_version,
-    is_transformers_available,
-    logging,
-)
-from .unet_loader_utils import _maybe_expand_lora_scales

+class IPAdapterMixin(IPAdapterMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `IPAdapterMixin` from diffusers.loaders.ip_adapter has been deprecated. Please use `from diffusers.loaders.ip_adapter.ip_adapter import IPAdapterMixin` instead."
+        deprecate("diffusers.loaders.ip_adapter.IPAdapterMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)

-if is_transformers_available():
-    from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection, SiglipImageProcessor, SiglipVisionModel

-from ..models.attention_processor import (
-    AttnProcessor,
-    AttnProcessor2_0,
-    FluxAttnProcessor2_0,
-    FluxIPAdapterJointAttnProcessor2_0,
-    IPAdapterAttnProcessor,
-    IPAdapterAttnProcessor2_0,
-    IPAdapterXFormersAttnProcessor,
-    JointAttnProcessor2_0,
-    SD3IPAdapterJointAttnProcessor2_0,
-)
+class FluxIPAdapterMixin(FluxIPAdapterMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `FluxIPAdapterMixin` from diffusers.loaders.ip_adapter has been deprecated. Please use `from diffusers.loaders.ip_adapter.ip_adapter import FluxIPAdapterMixin` instead."
+        deprecate("diffusers.loaders.ip_adapter.FluxIPAdapterMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)


-logger = logging.get_logger(__name__)
-
-
-class IPAdapterMixin:
-    """Mixin for handling IP Adapters."""
-
-    @validate_hf_hub_args
-    def load_ip_adapter(
-        self,
-        pretrained_model_name_or_path_or_dict: Union[str, List[str], Dict[str, torch.Tensor]],
-        subfolder: Union[str, List[str]],
-        weight_name: Union[str, List[str]],
-        image_encoder_folder: Optional[str] = "image_encoder",
-        **kwargs,
-    ):
-        """
-        Parameters:
-            pretrained_model_name_or_path_or_dict (`str` or `List[str]` or `os.PathLike` or `List[os.PathLike]` or `dict` or `List[dict]`):
-                Can be either:
-
-                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
-                      the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-                    - A [torch state
-                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
-            subfolder (`str` or `List[str]`):
-                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
-                list is passed, it should have the same length as `weight_name`.
-            weight_name (`str` or `List[str]`):
-                The name of the weight file to load. If a list is passed, it should have the same length as
-                `subfolder`.
-            image_encoder_folder (`str`, *optional*, defaults to `image_encoder`):
-                The subfolder location of the image encoder within a larger model repository on the Hub or locally.
-                Pass `None` to not load the image encoder. If the image encoder is located in a folder inside
-                `subfolder`, you only need to pass the name of the folder that contains image encoder weights, e.g.
-                `image_encoder_folder="image_encoder"`. If the image encoder is located in a folder other than
-                `subfolder`, you should pass the path to the folder that contains image encoder weights, for example,
-                `image_encoder_folder="different_subfolder/image_encoder"`.
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to `True`, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-
-        # handle the list inputs for multiple IP Adapters
-        if not isinstance(weight_name, list):
-            weight_name = [weight_name]
-
-        if not isinstance(pretrained_model_name_or_path_or_dict, list):
-            pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
-        if len(pretrained_model_name_or_path_or_dict) == 1:
-            pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
-
-        if not isinstance(subfolder, list):
-            subfolder = [subfolder]
-        if len(subfolder) == 1:
-            subfolder = subfolder * len(weight_name)
-
-        if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
-            raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
-
-        if len(weight_name) != len(subfolder):
-            raise ValueError("`weight_name` and `subfolder` must have the same length.")
-
-        # Load the main state dict first.
-        cache_dir = kwargs.pop("cache_dir", None)
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        local_files_only = kwargs.pop("local_files_only", None)
-        token = kwargs.pop("token", None)
-        revision = kwargs.pop("revision", None)
-        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
-
-        if low_cpu_mem_usage and not is_accelerate_available():
-            low_cpu_mem_usage = False
-            logger.warning(
-                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                " install accelerate\n```\n."
-            )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        user_agent = {
-            "file_type": "attn_procs_weights",
-            "framework": "pytorch",
-        }
-        state_dicts = []
-        for pretrained_model_name_or_path_or_dict, weight_name, subfolder in zip(
-            pretrained_model_name_or_path_or_dict, weight_name, subfolder
-        ):
-            if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                model_file = _get_model_file(
-                    pretrained_model_name_or_path_or_dict,
-                    weights_name=weight_name,
-                    cache_dir=cache_dir,
-                    force_download=force_download,
-                    proxies=proxies,
-                    local_files_only=local_files_only,
-                    token=token,
-                    revision=revision,
-                    subfolder=subfolder,
-                    user_agent=user_agent,
-                )
-                if weight_name.endswith(".safetensors"):
-                    state_dict = {"image_proj": {}, "ip_adapter": {}}
-                    with safe_open(model_file, framework="pt", device="cpu") as f:
-                        for key in f.keys():
-                            if key.startswith("image_proj."):
-                                state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
-                            elif key.startswith("ip_adapter."):
-                                state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
-                else:
-                    state_dict = load_state_dict(model_file)
-            else:
-                state_dict = pretrained_model_name_or_path_or_dict
-
-            keys = list(state_dict.keys())
-            if "image_proj" not in keys and "ip_adapter" not in keys:
-                raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
-
-            state_dicts.append(state_dict)
-
-            # load CLIP image encoder here if it has not been registered to the pipeline yet
-            if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
-                if image_encoder_folder is not None:
-                    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                        logger.info(f"loading image_encoder from {pretrained_model_name_or_path_or_dict}")
-                        if image_encoder_folder.count("/") == 0:
-                            image_encoder_subfolder = Path(subfolder, image_encoder_folder).as_posix()
-                        else:
-                            image_encoder_subfolder = Path(image_encoder_folder).as_posix()
-
-                        image_encoder = CLIPVisionModelWithProjection.from_pretrained(
-                            pretrained_model_name_or_path_or_dict,
-                            subfolder=image_encoder_subfolder,
-                            low_cpu_mem_usage=low_cpu_mem_usage,
-                            cache_dir=cache_dir,
-                            local_files_only=local_files_only,
-                            torch_dtype=self.dtype,
-                        ).to(self.device)
-                        self.register_modules(image_encoder=image_encoder)
-                    else:
-                        raise ValueError(
-                            "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
-                        )
-                else:
-                    logger.warning(
-                        "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
-                        "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
-                    )
-
-            # create feature extractor if it has not been registered to the pipeline yet
-            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
-                # FaceID IP adapters don't need the image encoder so it's not present, in this case we default to 224
-                default_clip_size = 224
-                clip_image_size = (
-                    self.image_encoder.config.image_size if self.image_encoder is not None else default_clip_size
-                )
-                feature_extractor = CLIPImageProcessor(size=clip_image_size, crop_size=clip_image_size)
-                self.register_modules(feature_extractor=feature_extractor)
-
-        # load ip-adapter into unet
-        unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet
-        unet._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
-
-        extra_loras = unet._load_ip_adapter_loras(state_dicts)
-        if extra_loras != {}:
-            if not USE_PEFT_BACKEND:
-                logger.warning("PEFT backend is required to load these weights.")
-            else:
-                # apply the IP Adapter Face ID LoRA weights
-                peft_config = getattr(unet, "peft_config", {})
-                for k, lora in extra_loras.items():
-                    if f"faceid_{k}" not in peft_config:
-                        self.load_lora_weights(lora, adapter_name=f"faceid_{k}")
-                        self.set_adapters([f"faceid_{k}"], adapter_weights=[1.0])
-
-    def set_ip_adapter_scale(self, scale):
-        """
-        Set IP-Adapter scales per-transformer block. Input `scale` could be a single config or a list of configs for
-        granular control over each IP-Adapter behavior. A config can be a float or a dictionary.
-
-        Example:
-
-        ```py
-        # To use original IP-Adapter
-        scale = 1.0
-        pipeline.set_ip_adapter_scale(scale)
-
-        # To use style block only
-        scale = {
-            "up": {"block_0": [0.0, 1.0, 0.0]},
-        }
-        pipeline.set_ip_adapter_scale(scale)
-
-        # To use style+layout blocks
-        scale = {
-            "down": {"block_2": [0.0, 1.0]},
-            "up": {"block_0": [0.0, 1.0, 0.0]},
-        }
-        pipeline.set_ip_adapter_scale(scale)
-
-        # To use style and layout from 2 reference images
-        scales = [{"down": {"block_2": [0.0, 1.0]}}, {"up": {"block_0": [0.0, 1.0, 0.0]}}]
-        pipeline.set_ip_adapter_scale(scales)
-        ```
-        """
-        unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet
-        if not isinstance(scale, list):
-            scale = [scale]
-        scale_configs = _maybe_expand_lora_scales(unet, scale, default_scale=0.0)
-
-        for attn_name, attn_processor in unet.attn_processors.items():
-            if isinstance(
-                attn_processor, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0, IPAdapterXFormersAttnProcessor)
-            ):
-                if len(scale_configs) != len(attn_processor.scale):
-                    raise ValueError(
-                        f"Cannot assign {len(scale_configs)} scale_configs to {len(attn_processor.scale)} IP-Adapter."
-                    )
-                elif len(scale_configs) == 1:
-                    scale_configs = scale_configs * len(attn_processor.scale)
-                for i, scale_config in enumerate(scale_configs):
-                    if isinstance(scale_config, dict):
-                        for k, s in scale_config.items():
-                            if attn_name.startswith(k):
-                                attn_processor.scale[i] = s
-                    else:
-                        attn_processor.scale[i] = scale_config
-
-    def unload_ip_adapter(self):
-        """
-        Unloads the IP Adapter weights
-
-        Examples:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.unload_ip_adapter()
-        >>> ...
-        ```
-        """
-        # remove CLIP image encoder
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
-            self.image_encoder = None
-            self.register_to_config(image_encoder=[None, None])
-
-        # remove feature extractor only when safety_checker is None as safety_checker uses
-        # the feature_extractor later
-        if not hasattr(self, "safety_checker"):
-            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
-                self.feature_extractor = None
-                self.register_to_config(feature_extractor=[None, None])
-
-        # remove hidden encoder
-        self.unet.encoder_hid_proj = None
-        self.unet.config.encoder_hid_dim_type = None
-
-        # Kolors: restore `encoder_hid_proj` with `text_encoder_hid_proj`
-        if hasattr(self.unet, "text_encoder_hid_proj") and self.unet.text_encoder_hid_proj is not None:
-            self.unet.encoder_hid_proj = self.unet.text_encoder_hid_proj
-            self.unet.text_encoder_hid_proj = None
-            self.unet.config.encoder_hid_dim_type = "text_proj"
-
-        # restore original Unet attention processors layers
-        attn_procs = {}
-        for name, value in self.unet.attn_processors.items():
-            attn_processor_class = (
-                AttnProcessor2_0() if hasattr(F, "scaled_dot_product_attention") else AttnProcessor()
-            )
-            attn_procs[name] = (
-                attn_processor_class
-                if isinstance(
-                    value, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0, IPAdapterXFormersAttnProcessor)
-                )
-                else value.__class__()
-            )
-        self.unet.set_attn_processor(attn_procs)
-
-
-class FluxIPAdapterMixin:
-    """Mixin for handling Flux IP Adapters."""
-
-    @validate_hf_hub_args
-    def load_ip_adapter(
-        self,
-        pretrained_model_name_or_path_or_dict: Union[str, List[str], Dict[str, torch.Tensor]],
-        weight_name: Union[str, List[str]],
-        subfolder: Optional[Union[str, List[str]]] = "",
-        image_encoder_pretrained_model_name_or_path: Optional[str] = "image_encoder",
-        image_encoder_subfolder: Optional[str] = "",
-        image_encoder_dtype: torch.dtype = torch.float16,
-        **kwargs,
-    ):
-        """
-        Parameters:
-            pretrained_model_name_or_path_or_dict (`str` or `List[str]` or `os.PathLike` or `List[os.PathLike]` or `dict` or `List[dict]`):
-                Can be either:
-
-                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
-                      the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-                    - A [torch state
-                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
-            subfolder (`str` or `List[str]`):
-                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
-                list is passed, it should have the same length as `weight_name`.
-            weight_name (`str` or `List[str]`):
-                The name of the weight file to load. If a list is passed, it should have the same length as
-                `weight_name`.
-            image_encoder_pretrained_model_name_or_path (`str`, *optional*, defaults to `./image_encoder`):
-                Can be either:
-
-                    - A string, the *model id* (for example `openai/clip-vit-large-patch14`) of a pretrained model
-                      hosted on the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to `True`, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-
-        # handle the list inputs for multiple IP Adapters
-        if not isinstance(weight_name, list):
-            weight_name = [weight_name]
-
-        if not isinstance(pretrained_model_name_or_path_or_dict, list):
-            pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
-        if len(pretrained_model_name_or_path_or_dict) == 1:
-            pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
-
-        if not isinstance(subfolder, list):
-            subfolder = [subfolder]
-        if len(subfolder) == 1:
-            subfolder = subfolder * len(weight_name)
-
-        if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
-            raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
-
-        if len(weight_name) != len(subfolder):
-            raise ValueError("`weight_name` and `subfolder` must have the same length.")
-
-        # Load the main state dict first.
-        cache_dir = kwargs.pop("cache_dir", None)
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        local_files_only = kwargs.pop("local_files_only", None)
-        token = kwargs.pop("token", None)
-        revision = kwargs.pop("revision", None)
-        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
-
-        if low_cpu_mem_usage and not is_accelerate_available():
-            low_cpu_mem_usage = False
-            logger.warning(
-                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                " install accelerate\n```\n."
-            )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        user_agent = {
-            "file_type": "attn_procs_weights",
-            "framework": "pytorch",
-        }
-        state_dicts = []
-        for pretrained_model_name_or_path_or_dict, weight_name, subfolder in zip(
-            pretrained_model_name_or_path_or_dict, weight_name, subfolder
-        ):
-            if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                model_file = _get_model_file(
-                    pretrained_model_name_or_path_or_dict,
-                    weights_name=weight_name,
-                    cache_dir=cache_dir,
-                    force_download=force_download,
-                    proxies=proxies,
-                    local_files_only=local_files_only,
-                    token=token,
-                    revision=revision,
-                    subfolder=subfolder,
-                    user_agent=user_agent,
-                )
-                if weight_name.endswith(".safetensors"):
-                    state_dict = {"image_proj": {}, "ip_adapter": {}}
-                    with safe_open(model_file, framework="pt", device="cpu") as f:
-                        image_proj_keys = ["ip_adapter_proj_model.", "image_proj."]
-                        ip_adapter_keys = ["double_blocks.", "ip_adapter."]
-                        for key in f.keys():
-                            if any(key.startswith(prefix) for prefix in image_proj_keys):
-                                diffusers_name = ".".join(key.split(".")[1:])
-                                state_dict["image_proj"][diffusers_name] = f.get_tensor(key)
-                            elif any(key.startswith(prefix) for prefix in ip_adapter_keys):
-                                diffusers_name = (
-                                    ".".join(key.split(".")[1:])
-                                    .replace("ip_adapter_double_stream_k_proj", "to_k_ip")
-                                    .replace("ip_adapter_double_stream_v_proj", "to_v_ip")
-                                    .replace("processor.", "")
-                                )
-                                state_dict["ip_adapter"][diffusers_name] = f.get_tensor(key)
-                else:
-                    state_dict = load_state_dict(model_file)
-            else:
-                state_dict = pretrained_model_name_or_path_or_dict
-
-            keys = list(state_dict.keys())
-            if keys != ["image_proj", "ip_adapter"]:
-                raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
-
-            state_dicts.append(state_dict)
-
-            # load CLIP image encoder here if it has not been registered to the pipeline yet
-            if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
-                if image_encoder_pretrained_model_name_or_path is not None:
-                    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                        logger.info(f"loading image_encoder from {image_encoder_pretrained_model_name_or_path}")
-                        image_encoder = (
-                            CLIPVisionModelWithProjection.from_pretrained(
-                                image_encoder_pretrained_model_name_or_path,
-                                subfolder=image_encoder_subfolder,
-                                low_cpu_mem_usage=low_cpu_mem_usage,
-                                cache_dir=cache_dir,
-                                local_files_only=local_files_only,
-                                torch_dtype=image_encoder_dtype,
-                            )
-                            .to(self.device)
-                            .eval()
-                        )
-                        self.register_modules(image_encoder=image_encoder)
-                    else:
-                        raise ValueError(
-                            "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
-                        )
-                else:
-                    logger.warning(
-                        "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
-                        "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
-                    )
-
-            # create feature extractor if it has not been registered to the pipeline yet
-            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
-                # FaceID IP adapters don't need the image encoder so it's not present, in this case we default to 224
-                default_clip_size = 224
-                clip_image_size = (
-                    self.image_encoder.config.image_size if self.image_encoder is not None else default_clip_size
-                )
-                feature_extractor = CLIPImageProcessor(size=clip_image_size, crop_size=clip_image_size)
-                self.register_modules(feature_extractor=feature_extractor)
-
-        # load ip-adapter into transformer
-        self.transformer._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
-
-    def set_ip_adapter_scale(self, scale: Union[float, List[float], List[List[float]]]):
-        """
-        Set IP-Adapter scales per-transformer block. Input `scale` could be a single config or a list of configs for
-        granular control over each IP-Adapter behavior. A config can be a float or a list.
-
-        `float` is converted to list and repeated for the number of blocks and the number of IP adapters. `List[float]`
-        length match the number of blocks, it is repeated for each IP adapter. `List[List[float]]` must match the
-        number of IP adapters and each must match the number of blocks.
-
-        Example:
-
-        ```py
-        # To use original IP-Adapter
-        scale = 1.0
-        pipeline.set_ip_adapter_scale(scale)
-
-
-        def LinearStrengthModel(start, finish, size):
-            return [(start + (finish - start) * (i / (size - 1))) for i in range(size)]
-
-
-        ip_strengths = LinearStrengthModel(0.3, 0.92, 19)
-        pipeline.set_ip_adapter_scale(ip_strengths)
-        ```
-        """
-
-        scale_type = Union[int, float]
-        num_ip_adapters = self.transformer.encoder_hid_proj.num_ip_adapters
-        num_layers = self.transformer.config.num_layers
-
-        # Single value for all layers of all IP-Adapters
-        if isinstance(scale, scale_type):
-            scale = [scale for _ in range(num_ip_adapters)]
-        # List of per-layer scales for a single IP-Adapter
-        elif _is_valid_type(scale, List[scale_type]) and num_ip_adapters == 1:
-            scale = [scale]
-        # Invalid scale type
-        elif not _is_valid_type(scale, List[Union[scale_type, List[scale_type]]]):
-            raise TypeError(f"Unexpected type {_get_detailed_type(scale)} for scale.")
-
-        if len(scale) != num_ip_adapters:
-            raise ValueError(f"Cannot assign {len(scale)} scales to {num_ip_adapters} IP-Adapters.")
-
-        if any(len(s) != num_layers for s in scale if isinstance(s, list)):
-            invalid_scale_sizes = {len(s) for s in scale if isinstance(s, list)} - {num_layers}
-            raise ValueError(
-                f"Expected list of {num_layers} scales, got {', '.join(str(x) for x in invalid_scale_sizes)}."
-            )
-
-        # Scalars are transformed to lists with length num_layers
-        scale_configs = [[s] * num_layers if isinstance(s, scale_type) else s for s in scale]
-
-        # Set scales. zip over scale_configs prevents going into single transformer layers
-        for attn_processor, *scale in zip(self.transformer.attn_processors.values(), *scale_configs):
-            attn_processor.scale = scale
-
-    def unload_ip_adapter(self):
-        """
-        Unloads the IP Adapter weights
-
-        Examples:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.unload_ip_adapter()
-        >>> ...
-        ```
-        """
-        # remove CLIP image encoder
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
-            self.image_encoder = None
-            self.register_to_config(image_encoder=[None, None])
-
-        # remove feature extractor only when safety_checker is None as safety_checker uses
-        # the feature_extractor later
-        if not hasattr(self, "safety_checker"):
-            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
-                self.feature_extractor = None
-                self.register_to_config(feature_extractor=[None, None])
-
-        # remove hidden encoder
-        self.transformer.encoder_hid_proj = None
-        self.transformer.config.encoder_hid_dim_type = None
-
-        # restore original Transformer attention processors layers
-        attn_procs = {}
-        for name, value in self.transformer.attn_processors.items():
-            attn_processor_class = FluxAttnProcessor2_0()
-            attn_procs[name] = (
-                attn_processor_class if isinstance(value, (FluxIPAdapterJointAttnProcessor2_0)) else value.__class__()
-            )
-        self.transformer.set_attn_processor(attn_procs)
-
-
-class SD3IPAdapterMixin:
-    """Mixin for handling StableDiffusion 3 IP Adapters."""
-
-    @property
-    def is_ip_adapter_active(self) -> bool:
-        """Checks if IP-Adapter is loaded and scale > 0.
-
-        IP-Adapter scale controls the influence of the image prompt versus text prompt. When this value is set to 0,
-        the image context is irrelevant.
-
-        Returns:
-            `bool`: True when IP-Adapter is loaded and any layer has scale > 0.
-        """
-        scales = [
-            attn_proc.scale
-            for attn_proc in self.transformer.attn_processors.values()
-            if isinstance(attn_proc, SD3IPAdapterJointAttnProcessor2_0)
-        ]
-
-        return len(scales) > 0 and any(scale > 0 for scale in scales)
-
-    @validate_hf_hub_args
-    def load_ip_adapter(
-        self,
-        pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]],
-        weight_name: str = "ip-adapter.safetensors",
-        subfolder: Optional[str] = None,
-        image_encoder_folder: Optional[str] = "image_encoder",
-        **kwargs,
-    ) -> None:
-        """
-        Parameters:
-            pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`):
-                Can be either:
-                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
-                      the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-                    - A [torch state
-                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
-            weight_name (`str`, defaults to "ip-adapter.safetensors"):
-                The name of the weight file to load. If a list is passed, it should have the same length as
-                `subfolder`.
-            subfolder (`str`, *optional*):
-                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
-                list is passed, it should have the same length as `weight_name`.
-            image_encoder_folder (`str`, *optional*, defaults to `image_encoder`):
-                The subfolder location of the image encoder within a larger model repository on the Hub or locally.
-                Pass `None` to not load the image encoder. If the image encoder is located in a folder inside
-                `subfolder`, you only need to pass the name of the folder that contains image encoder weights, e.g.
-                `image_encoder_folder="image_encoder"`. If the image encoder is located in a folder other than
-                `subfolder`, you should pass the path to the folder that contains image encoder weights, for example,
-                `image_encoder_folder="different_subfolder/image_encoder"`.
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to `True`, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-        # Load the main state dict first
-        cache_dir = kwargs.pop("cache_dir", None)
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        local_files_only = kwargs.pop("local_files_only", None)
-        token = kwargs.pop("token", None)
-        revision = kwargs.pop("revision", None)
-        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
-
-        if low_cpu_mem_usage and not is_accelerate_available():
-            low_cpu_mem_usage = False
-            logger.warning(
-                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                " install accelerate\n```\n."
-            )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        user_agent = {
-            "file_type": "attn_procs_weights",
-            "framework": "pytorch",
-        }
-
-        if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-            model_file = _get_model_file(
-                pretrained_model_name_or_path_or_dict,
-                weights_name=weight_name,
-                cache_dir=cache_dir,
-                force_download=force_download,
-                proxies=proxies,
-                local_files_only=local_files_only,
-                token=token,
-                revision=revision,
-                subfolder=subfolder,
-                user_agent=user_agent,
-            )
-            if weight_name.endswith(".safetensors"):
-                state_dict = {"image_proj": {}, "ip_adapter": {}}
-                with safe_open(model_file, framework="pt", device="cpu") as f:
-                    for key in f.keys():
-                        if key.startswith("image_proj."):
-                            state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
-                        elif key.startswith("ip_adapter."):
-                            state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
-            else:
-                state_dict = load_state_dict(model_file)
-        else:
-            state_dict = pretrained_model_name_or_path_or_dict
-
-        keys = list(state_dict.keys())
-        if "image_proj" not in keys and "ip_adapter" not in keys:
-            raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
-
-        # Load image_encoder and feature_extractor here if they haven't been registered to the pipeline yet
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
-            if image_encoder_folder is not None:
-                if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                    logger.info(f"loading image_encoder from {pretrained_model_name_or_path_or_dict}")
-                    if image_encoder_folder.count("/") == 0:
-                        image_encoder_subfolder = Path(subfolder, image_encoder_folder).as_posix()
-                    else:
-                        image_encoder_subfolder = Path(image_encoder_folder).as_posix()
-
-                    # Commons args for loading image encoder and image processor
-                    kwargs = {
-                        "low_cpu_mem_usage": low_cpu_mem_usage,
-                        "cache_dir": cache_dir,
-                        "local_files_only": local_files_only,
-                    }
-
-                    self.register_modules(
-                        feature_extractor=SiglipImageProcessor.from_pretrained(image_encoder_subfolder, **kwargs),
-                        image_encoder=SiglipVisionModel.from_pretrained(
-                            image_encoder_subfolder, torch_dtype=self.dtype, **kwargs
-                        ).to(self.device),
-                    )
-                else:
-                    raise ValueError(
-                        "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
-                    )
-            else:
-                logger.warning(
-                    "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
-                    "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
-                )
-
-        # Load IP-Adapter into transformer
-        self.transformer._load_ip_adapter_weights(state_dict, low_cpu_mem_usage=low_cpu_mem_usage)
-
-    def set_ip_adapter_scale(self, scale: float) -> None:
-        """
-        Set IP-Adapter scale, which controls image prompt conditioning. A value of 1.0 means the model is only
-        conditioned on the image prompt, and 0.0 only conditioned by the text prompt. Lowering this value encourages
-        the model to produce more diverse images, but they may not be as aligned with the image prompt.
-
-        Example:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.set_ip_adapter_scale(0.6)
-        >>> ...
-        ```
-
-        Args:
-            scale (float):
-                IP-Adapter scale to be set.
-
-        """
-        for attn_processor in self.transformer.attn_processors.values():
-            if isinstance(attn_processor, SD3IPAdapterJointAttnProcessor2_0):
-                attn_processor.scale = scale
-
-    def unload_ip_adapter(self) -> None:
-        """
-        Unloads the IP Adapter weights.
-
-        Example:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.unload_ip_adapter()
-        >>> ...
-        ```
-        """
-        # Remove image encoder
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
-            self.image_encoder = None
-            self.register_to_config(image_encoder=None)
-
-        # Remove feature extractor
-        if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
-            self.feature_extractor = None
-            self.register_to_config(feature_extractor=None)
-
-        # Remove image projection
-        self.transformer.image_proj = None
-
-        # Restore original attention processors layers
-        attn_procs = {
-            name: (
-                JointAttnProcessor2_0() if isinstance(value, SD3IPAdapterJointAttnProcessor2_0) else value.__class__()
-            )
-            for name, value in self.transformer.attn_processors.items()
-        }
-        self.transformer.set_attn_processor(attn_procs)
+class SD3IPAdapterMixin(SD3IPAdapterMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `SD3IPAdapterMixin` from diffusers.loaders.ip_adapter has been deprecated. Please use `from diffusers.loaders.ip_adapter.ip_adapter import SD3IPAdapterMixin` instead."
+        deprecate("diffusers.loaders.ip_adapter.SD3IPAdapterMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)
@@ -0,0 +1,9 @@
+from ...utils.import_utils import is_torch_available, is_transformers_available
+
+
+if is_torch_available():
+    from .transformer_flux import FluxTransformer2DLoadersMixin
+    from .transformer_sd3 import SD3Transformer2DLoadersMixin
+
+    if is_transformers_available():
+        from .ip_adapter import FluxIPAdapterMixin, IPAdapterMixin, SD3IPAdapterMixin
@@ -0,0 +1,879 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pathlib import Path
+from typing import Dict, List, Optional, Union
+
+import torch
+import torch.nn.functional as F
+from huggingface_hub.utils import validate_hf_hub_args
+from safetensors import safe_open
+
+from ...models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_state_dict
+from ...utils import (
+    USE_PEFT_BACKEND,
+    _get_detailed_type,
+    _get_model_file,
+    _is_valid_type,
+    is_accelerate_available,
+    is_torch_version,
+    is_transformers_available,
+    logging,
+)
+from ..unet.unet_loader_utils import _maybe_expand_lora_scales
+
+
+if is_transformers_available():
+    from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection, SiglipImageProcessor, SiglipVisionModel
+
+from ...models.attention_processor import (
+    AttnProcessor,
+    AttnProcessor2_0,
+    FluxAttnProcessor2_0,
+    FluxIPAdapterJointAttnProcessor2_0,
+    IPAdapterAttnProcessor,
+    IPAdapterAttnProcessor2_0,
+    IPAdapterXFormersAttnProcessor,
+    JointAttnProcessor2_0,
+    SD3IPAdapterJointAttnProcessor2_0,
+)
+
+
+logger = logging.get_logger(__name__)
+
+
+class IPAdapterMixin:
+    """Mixin for handling IP Adapters."""
+
+    @validate_hf_hub_args
+    def load_ip_adapter(
+        self,
+        pretrained_model_name_or_path_or_dict: Union[str, List[str], Dict[str, torch.Tensor]],
+        subfolder: Union[str, List[str]],
+        weight_name: Union[str, List[str]],
+        image_encoder_folder: Optional[str] = "image_encoder",
+        **kwargs,
+    ):
+        """
+        Parameters:
+            pretrained_model_name_or_path_or_dict (`str` or `List[str]` or `os.PathLike` or `List[os.PathLike]` or `dict` or `List[dict]`):
+                Can be either:
+
+                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
+                      the Hub.
+                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
+                      with [`ModelMixin.save_pretrained`].
+                    - A [torch state
+                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
+            subfolder (`str` or `List[str]`):
+                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
+                list is passed, it should have the same length as `weight_name`.
+            weight_name (`str` or `List[str]`):
+                The name of the weight file to load. If a list is passed, it should have the same length as
+                `subfolder`.
+            image_encoder_folder (`str`, *optional*, defaults to `image_encoder`):
+                The subfolder location of the image encoder within a larger model repository on the Hub or locally.
+                Pass `None` to not load the image encoder. If the image encoder is located in a folder inside
+                `subfolder`, you only need to pass the name of the folder that contains image encoder weights, e.g.
+                `image_encoder_folder="image_encoder"`. If the image encoder is located in a folder other than
+                `subfolder`, you should pass the path to the folder that contains image encoder weights, for example,
+                `image_encoder_folder="different_subfolder/image_encoder"`.
+            cache_dir (`Union[str, os.PathLike]`, *optional*):
+                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
+                is not used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            local_files_only (`bool`, *optional*, defaults to `False`):
+                Whether to only load local model weights and configuration files or not. If set to `True`, the model
+                won't be downloaded from the Hub.
+            token (`str` or *bool*, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
+                `diffusers-cli login` (stored in `~/.huggingface`) is used.
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
+                allowed by Git.
+            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
+                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
+                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
+                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
+                argument to `True` will raise an error.
+        """
+
+        # handle the list inputs for multiple IP Adapters
+        if not isinstance(weight_name, list):
+            weight_name = [weight_name]
+
+        if not isinstance(pretrained_model_name_or_path_or_dict, list):
+            pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
+        if len(pretrained_model_name_or_path_or_dict) == 1:
+            pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
+
+        if not isinstance(subfolder, list):
+            subfolder = [subfolder]
+        if len(subfolder) == 1:
+            subfolder = subfolder * len(weight_name)
+
+        if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
+            raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
+
+        if len(weight_name) != len(subfolder):
+            raise ValueError("`weight_name` and `subfolder` must have the same length.")
+
+        # Load the main state dict first.
+        cache_dir = kwargs.pop("cache_dir", None)
+        force_download = kwargs.pop("force_download", False)
+        proxies = kwargs.pop("proxies", None)
+        local_files_only = kwargs.pop("local_files_only", None)
+        token = kwargs.pop("token", None)
+        revision = kwargs.pop("revision", None)
+        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
+
+        if low_cpu_mem_usage and not is_accelerate_available():
+            low_cpu_mem_usage = False
+            logger.warning(
+                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                " install accelerate\n```\n."
+            )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        user_agent = {
+            "file_type": "attn_procs_weights",
+            "framework": "pytorch",
+        }
+        state_dicts = []
+        for pretrained_model_name_or_path_or_dict, weight_name, subfolder in zip(
+            pretrained_model_name_or_path_or_dict, weight_name, subfolder
+        ):
+            if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+                model_file = _get_model_file(
+                    pretrained_model_name_or_path_or_dict,
+                    weights_name=weight_name,
+                    cache_dir=cache_dir,
+                    force_download=force_download,
+                    proxies=proxies,
+                    local_files_only=local_files_only,
+                    token=token,
+                    revision=revision,
+                    subfolder=subfolder,
+                    user_agent=user_agent,
+                )
+                if weight_name.endswith(".safetensors"):
+                    state_dict = {"image_proj": {}, "ip_adapter": {}}
+                    with safe_open(model_file, framework="pt", device="cpu") as f:
+                        for key in f.keys():
+                            if key.startswith("image_proj."):
+                                state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
+                            elif key.startswith("ip_adapter."):
+                                state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
+                else:
+                    state_dict = load_state_dict(model_file)
+            else:
+                state_dict = pretrained_model_name_or_path_or_dict
+
+            keys = list(state_dict.keys())
+            if "image_proj" not in keys and "ip_adapter" not in keys:
+                raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
+
+            state_dicts.append(state_dict)
+
+            # load CLIP image encoder here if it has not been registered to the pipeline yet
+            if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
+                if image_encoder_folder is not None:
+                    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+                        logger.info(f"loading image_encoder from {pretrained_model_name_or_path_or_dict}")
+                        if image_encoder_folder.count("/") == 0:
+                            image_encoder_subfolder = Path(subfolder, image_encoder_folder).as_posix()
+                        else:
+                            image_encoder_subfolder = Path(image_encoder_folder).as_posix()
+
+                        image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+                            pretrained_model_name_or_path_or_dict,
+                            subfolder=image_encoder_subfolder,
+                            low_cpu_mem_usage=low_cpu_mem_usage,
+                            cache_dir=cache_dir,
+                            local_files_only=local_files_only,
+                            torch_dtype=self.dtype,
+                        ).to(self.device)
+                        self.register_modules(image_encoder=image_encoder)
+                    else:
+                        raise ValueError(
+                            "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
+                        )
+                else:
+                    logger.warning(
+                        "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
+                        "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
+                    )
+
+            # create feature extractor if it has not been registered to the pipeline yet
+            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
+                # FaceID IP adapters don't need the image encoder so it's not present, in this case we default to 224
+                default_clip_size = 224
+                clip_image_size = (
+                    self.image_encoder.config.image_size if self.image_encoder is not None else default_clip_size
+                )
+                feature_extractor = CLIPImageProcessor(size=clip_image_size, crop_size=clip_image_size)
+                self.register_modules(feature_extractor=feature_extractor)
+
+        # load ip-adapter into unet
+        unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet
+        unet._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
+
+        extra_loras = unet._load_ip_adapter_loras(state_dicts)
+        if extra_loras != {}:
+            if not USE_PEFT_BACKEND:
+                logger.warning("PEFT backend is required to load these weights.")
+            else:
+                # apply the IP Adapter Face ID LoRA weights
+                peft_config = getattr(unet, "peft_config", {})
+                for k, lora in extra_loras.items():
+                    if f"faceid_{k}" not in peft_config:
+                        self.load_lora_weights(lora, adapter_name=f"faceid_{k}")
+                        self.set_adapters([f"faceid_{k}"], adapter_weights=[1.0])
+
+    def set_ip_adapter_scale(self, scale):
+        """
+        Set IP-Adapter scales per-transformer block. Input `scale` could be a single config or a list of configs for
+        granular control over each IP-Adapter behavior. A config can be a float or a dictionary.
+
+        Example:
+
+        ```py
+        # To use original IP-Adapter
+        scale = 1.0
+        pipeline.set_ip_adapter_scale(scale)
+
+        # To use style block only
+        scale = {
+            "up": {"block_0": [0.0, 1.0, 0.0]},
+        }
+        pipeline.set_ip_adapter_scale(scale)
+
+        # To use style+layout blocks
+        scale = {
+            "down": {"block_2": [0.0, 1.0]},
+            "up": {"block_0": [0.0, 1.0, 0.0]},
+        }
+        pipeline.set_ip_adapter_scale(scale)
+
+        # To use style and layout from 2 reference images
+        scales = [{"down": {"block_2": [0.0, 1.0]}}, {"up": {"block_0": [0.0, 1.0, 0.0]}}]
+        pipeline.set_ip_adapter_scale(scales)
+        ```
+        """
+        unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet
+        if not isinstance(scale, list):
+            scale = [scale]
+        scale_configs = _maybe_expand_lora_scales(unet, scale, default_scale=0.0)
+
+        for attn_name, attn_processor in unet.attn_processors.items():
+            if isinstance(
+                attn_processor, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0, IPAdapterXFormersAttnProcessor)
+            ):
+                if len(scale_configs) != len(attn_processor.scale):
+                    raise ValueError(
+                        f"Cannot assign {len(scale_configs)} scale_configs to {len(attn_processor.scale)} IP-Adapter."
+                    )
+                elif len(scale_configs) == 1:
+                    scale_configs = scale_configs * len(attn_processor.scale)
+                for i, scale_config in enumerate(scale_configs):
+                    if isinstance(scale_config, dict):
+                        for k, s in scale_config.items():
+                            if attn_name.startswith(k):
+                                attn_processor.scale[i] = s
+                    else:
+                        attn_processor.scale[i] = scale_config
+
+    def unload_ip_adapter(self):
+        """
+        Unloads the IP Adapter weights
+
+        Examples:
+
+        ```python
+        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
+        >>> pipeline.unload_ip_adapter()
+        >>> ...
+        ```
+        """
+        # remove CLIP image encoder
+        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
+            self.image_encoder = None
+            self.register_to_config(image_encoder=[None, None])
+
+        # remove feature extractor only when safety_checker is None as safety_checker uses
+        # the feature_extractor later
+        if not hasattr(self, "safety_checker"):
+            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
+                self.feature_extractor = None
+                self.register_to_config(feature_extractor=[None, None])
+
+        # remove hidden encoder
+        self.unet.encoder_hid_proj = None
+        self.unet.config.encoder_hid_dim_type = None
+
+        # Kolors: restore `encoder_hid_proj` with `text_encoder_hid_proj`
+        if hasattr(self.unet, "text_encoder_hid_proj") and self.unet.text_encoder_hid_proj is not None:
+            self.unet.encoder_hid_proj = self.unet.text_encoder_hid_proj
+            self.unet.text_encoder_hid_proj = None
+            self.unet.config.encoder_hid_dim_type = "text_proj"
+
+        # restore original Unet attention processors layers
+        attn_procs = {}
+        for name, value in self.unet.attn_processors.items():
+            attn_processor_class = (
+                AttnProcessor2_0() if hasattr(F, "scaled_dot_product_attention") else AttnProcessor()
+            )
+            attn_procs[name] = (
+                attn_processor_class
+                if isinstance(
+                    value, (IPAdapterAttnProcessor, IPAdapterAttnProcessor2_0, IPAdapterXFormersAttnProcessor)
+                )
+                else value.__class__()
+            )
+        self.unet.set_attn_processor(attn_procs)
+
+
+class FluxIPAdapterMixin:
+    """Mixin for handling Flux IP Adapters."""
+
+    @validate_hf_hub_args
+    def load_ip_adapter(
+        self,
+        pretrained_model_name_or_path_or_dict: Union[str, List[str], Dict[str, torch.Tensor]],
+        weight_name: Union[str, List[str]],
+        subfolder: Optional[Union[str, List[str]]] = "",
+        image_encoder_pretrained_model_name_or_path: Optional[str] = "image_encoder",
+        image_encoder_subfolder: Optional[str] = "",
+        image_encoder_dtype: torch.dtype = torch.float16,
+        **kwargs,
+    ):
+        """
+        Parameters:
+            pretrained_model_name_or_path_or_dict (`str` or `List[str]` or `os.PathLike` or `List[os.PathLike]` or `dict` or `List[dict]`):
+                Can be either:
+
+                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
+                      the Hub.
+                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
+                      with [`ModelMixin.save_pretrained`].
+                    - A [torch state
+                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
+            subfolder (`str` or `List[str]`):
+                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
+                list is passed, it should have the same length as `weight_name`.
+            weight_name (`str` or `List[str]`):
+                The name of the weight file to load. If a list is passed, it should have the same length as
+                `weight_name`.
+            image_encoder_pretrained_model_name_or_path (`str`, *optional*, defaults to `./image_encoder`):
+                Can be either:
+
+                    - A string, the *model id* (for example `openai/clip-vit-large-patch14`) of a pretrained model
+                      hosted on the Hub.
+                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
+                      with [`ModelMixin.save_pretrained`].
+            cache_dir (`Union[str, os.PathLike]`, *optional*):
+                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
+                is not used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            local_files_only (`bool`, *optional*, defaults to `False`):
+                Whether to only load local model weights and configuration files or not. If set to `True`, the model
+                won't be downloaded from the Hub.
+            token (`str` or *bool*, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
+                `diffusers-cli login` (stored in `~/.huggingface`) is used.
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
+                allowed by Git.
+            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
+                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
+                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
+                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
+                argument to `True` will raise an error.
+        """
+
+        # handle the list inputs for multiple IP Adapters
+        if not isinstance(weight_name, list):
+            weight_name = [weight_name]
+
+        if not isinstance(pretrained_model_name_or_path_or_dict, list):
+            pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
+        if len(pretrained_model_name_or_path_or_dict) == 1:
+            pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
+
+        if not isinstance(subfolder, list):
+            subfolder = [subfolder]
+        if len(subfolder) == 1:
+            subfolder = subfolder * len(weight_name)
+
+        if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
+            raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
+
+        if len(weight_name) != len(subfolder):
+            raise ValueError("`weight_name` and `subfolder` must have the same length.")
+
+        # Load the main state dict first.
+        cache_dir = kwargs.pop("cache_dir", None)
+        force_download = kwargs.pop("force_download", False)
+        proxies = kwargs.pop("proxies", None)
+        local_files_only = kwargs.pop("local_files_only", None)
+        token = kwargs.pop("token", None)
+        revision = kwargs.pop("revision", None)
+        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
+
+        if low_cpu_mem_usage and not is_accelerate_available():
+            low_cpu_mem_usage = False
+            logger.warning(
+                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                " install accelerate\n```\n."
+            )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        user_agent = {
+            "file_type": "attn_procs_weights",
+            "framework": "pytorch",
+        }
+        state_dicts = []
+        for pretrained_model_name_or_path_or_dict, weight_name, subfolder in zip(
+            pretrained_model_name_or_path_or_dict, weight_name, subfolder
+        ):
+            if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+                model_file = _get_model_file(
+                    pretrained_model_name_or_path_or_dict,
+                    weights_name=weight_name,
+                    cache_dir=cache_dir,
+                    force_download=force_download,
+                    proxies=proxies,
+                    local_files_only=local_files_only,
+                    token=token,
+                    revision=revision,
+                    subfolder=subfolder,
+                    user_agent=user_agent,
+                )
+                if weight_name.endswith(".safetensors"):
+                    state_dict = {"image_proj": {}, "ip_adapter": {}}
+                    with safe_open(model_file, framework="pt", device="cpu") as f:
+                        image_proj_keys = ["ip_adapter_proj_model.", "image_proj."]
+                        ip_adapter_keys = ["double_blocks.", "ip_adapter."]
+                        for key in f.keys():
+                            if any(key.startswith(prefix) for prefix in image_proj_keys):
+                                diffusers_name = ".".join(key.split(".")[1:])
+                                state_dict["image_proj"][diffusers_name] = f.get_tensor(key)
+                            elif any(key.startswith(prefix) for prefix in ip_adapter_keys):
+                                diffusers_name = (
+                                    ".".join(key.split(".")[1:])
+                                    .replace("ip_adapter_double_stream_k_proj", "to_k_ip")
+                                    .replace("ip_adapter_double_stream_v_proj", "to_v_ip")
+                                    .replace("processor.", "")
+                                )
+                                state_dict["ip_adapter"][diffusers_name] = f.get_tensor(key)
+                else:
+                    state_dict = load_state_dict(model_file)
+            else:
+                state_dict = pretrained_model_name_or_path_or_dict
+
+            keys = list(state_dict.keys())
+            if keys != ["image_proj", "ip_adapter"]:
+                raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
+
+            state_dicts.append(state_dict)
+
+            # load CLIP image encoder here if it has not been registered to the pipeline yet
+            if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
+                if image_encoder_pretrained_model_name_or_path is not None:
+                    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+                        logger.info(f"loading image_encoder from {image_encoder_pretrained_model_name_or_path}")
+                        image_encoder = (
+                            CLIPVisionModelWithProjection.from_pretrained(
+                                image_encoder_pretrained_model_name_or_path,
+                                subfolder=image_encoder_subfolder,
+                                low_cpu_mem_usage=low_cpu_mem_usage,
+                                cache_dir=cache_dir,
+                                local_files_only=local_files_only,
+                                dtype=image_encoder_dtype,
+                            )
+                            .to(self.device)
+                            .eval()
+                        )
+                        self.register_modules(image_encoder=image_encoder)
+                    else:
+                        raise ValueError(
+                            "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
+                        )
+                else:
+                    logger.warning(
+                        "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
+                        "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
+                    )
+
+            # create feature extractor if it has not been registered to the pipeline yet
+            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
+                # FaceID IP adapters don't need the image encoder so it's not present, in this case we default to 224
+                default_clip_size = 224
+                clip_image_size = (
+                    self.image_encoder.config.image_size if self.image_encoder is not None else default_clip_size
+                )
+                feature_extractor = CLIPImageProcessor(size=clip_image_size, crop_size=clip_image_size)
+                self.register_modules(feature_extractor=feature_extractor)
+
+        # load ip-adapter into transformer
+        self.transformer._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
+
+    def set_ip_adapter_scale(self, scale: Union[float, List[float], List[List[float]]]):
+        """
+        Set IP-Adapter scales per-transformer block. Input `scale` could be a single config or a list of configs for
+        granular control over each IP-Adapter behavior. A config can be a float or a list.
+
+        `float` is converted to list and repeated for the number of blocks and the number of IP adapters. `List[float]`
+        length match the number of blocks, it is repeated for each IP adapter. `List[List[float]]` must match the
+        number of IP adapters and each must match the number of blocks.
+
+        Example:
+
+        ```py
+        # To use original IP-Adapter
+        scale = 1.0
+        pipeline.set_ip_adapter_scale(scale)
+
+
+        def LinearStrengthModel(start, finish, size):
+            return [(start + (finish - start) * (i / (size - 1))) for i in range(size)]
+
+
+        ip_strengths = LinearStrengthModel(0.3, 0.92, 19)
+        pipeline.set_ip_adapter_scale(ip_strengths)
+        ```
+        """
+
+        scale_type = Union[int, float]
+        num_ip_adapters = self.transformer.encoder_hid_proj.num_ip_adapters
+        num_layers = self.transformer.config.num_layers
+
+        # Single value for all layers of all IP-Adapters
+        if isinstance(scale, scale_type):
+            scale = [scale for _ in range(num_ip_adapters)]
+        # List of per-layer scales for a single IP-Adapter
+        elif _is_valid_type(scale, List[scale_type]) and num_ip_adapters == 1:
+            scale = [scale]
+        # Invalid scale type
+        elif not _is_valid_type(scale, List[Union[scale_type, List[scale_type]]]):
+            raise TypeError(f"Unexpected type {_get_detailed_type(scale)} for scale.")
+
+        if len(scale) != num_ip_adapters:
+            raise ValueError(f"Cannot assign {len(scale)} scales to {num_ip_adapters} IP-Adapters.")
+
+        if any(len(s) != num_layers for s in scale if isinstance(s, list)):
+            invalid_scale_sizes = {len(s) for s in scale if isinstance(s, list)} - {num_layers}
+            raise ValueError(
+                f"Expected list of {num_layers} scales, got {', '.join(str(x) for x in invalid_scale_sizes)}."
+            )
+
+        # Scalars are transformed to lists with length num_layers
+        scale_configs = [[s] * num_layers if isinstance(s, scale_type) else s for s in scale]
+
+        # Set scales. zip over scale_configs prevents going into single transformer layers
+        for attn_processor, *scale in zip(self.transformer.attn_processors.values(), *scale_configs):
+            attn_processor.scale = scale
+
+    def unload_ip_adapter(self):
+        """
+        Unloads the IP Adapter weights
+
+        Examples:
+
+        ```python
+        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
+        >>> pipeline.unload_ip_adapter()
+        >>> ...
+        ```
+        """
+        # remove CLIP image encoder
+        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
+            self.image_encoder = None
+            self.register_to_config(image_encoder=[None, None])
+
+        # remove feature extractor only when safety_checker is None as safety_checker uses
+        # the feature_extractor later
+        if not hasattr(self, "safety_checker"):
+            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
+                self.feature_extractor = None
+                self.register_to_config(feature_extractor=[None, None])
+
+        # remove hidden encoder
+        self.transformer.encoder_hid_proj = None
+        self.transformer.config.encoder_hid_dim_type = None
+
+        # restore original Transformer attention processors layers
+        attn_procs = {}
+        for name, value in self.transformer.attn_processors.items():
+            attn_processor_class = FluxAttnProcessor2_0()
+            attn_procs[name] = (
+                attn_processor_class if isinstance(value, (FluxIPAdapterJointAttnProcessor2_0)) else value.__class__()
+            )
+        self.transformer.set_attn_processor(attn_procs)
+
+
+class SD3IPAdapterMixin:
+    """Mixin for handling StableDiffusion 3 IP Adapters."""
+
+    @property
+    def is_ip_adapter_active(self) -> bool:
+        """Checks if IP-Adapter is loaded and scale > 0.
+
+        IP-Adapter scale controls the influence of the image prompt versus text prompt. When this value is set to 0,
+        the image context is irrelevant.
+
+        Returns:
+            `bool`: True when IP-Adapter is loaded and any layer has scale > 0.
+        """
+        scales = [
+            attn_proc.scale
+            for attn_proc in self.transformer.attn_processors.values()
+            if isinstance(attn_proc, SD3IPAdapterJointAttnProcessor2_0)
+        ]
+
+        return len(scales) > 0 and any(scale > 0 for scale in scales)
+
+    @validate_hf_hub_args
+    def load_ip_adapter(
+        self,
+        pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]],
+        weight_name: str = "ip-adapter.safetensors",
+        subfolder: Optional[str] = None,
+        image_encoder_folder: Optional[str] = "image_encoder",
+        **kwargs,
+    ) -> None:
+        """
+        Parameters:
+            pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`):
+                Can be either:
+                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
+                      the Hub.
+                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
+                      with [`ModelMixin.save_pretrained`].
+                    - A [torch state
+                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
+            weight_name (`str`, defaults to "ip-adapter.safetensors"):
+                The name of the weight file to load. If a list is passed, it should have the same length as
+                `subfolder`.
+            subfolder (`str`, *optional*):
+                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
+                list is passed, it should have the same length as `weight_name`.
+            image_encoder_folder (`str`, *optional*, defaults to `image_encoder`):
+                The subfolder location of the image encoder within a larger model repository on the Hub or locally.
+                Pass `None` to not load the image encoder. If the image encoder is located in a folder inside
+                `subfolder`, you only need to pass the name of the folder that contains image encoder weights, e.g.
+                `image_encoder_folder="image_encoder"`. If the image encoder is located in a folder other than
+                `subfolder`, you should pass the path to the folder that contains image encoder weights, for example,
+                `image_encoder_folder="different_subfolder/image_encoder"`.
+            cache_dir (`Union[str, os.PathLike]`, *optional*):
+                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
+                is not used.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            local_files_only (`bool`, *optional*, defaults to `False`):
+                Whether to only load local model weights and configuration files or not. If set to `True`, the model
+                won't be downloaded from the Hub.
+            token (`str` or *bool*, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
+                `diffusers-cli login` (stored in `~/.huggingface`) is used.
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
+                allowed by Git.
+            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
+                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
+                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
+                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
+                argument to `True` will raise an error.
+        """
+        # Load the main state dict first
+        cache_dir = kwargs.pop("cache_dir", None)
+        force_download = kwargs.pop("force_download", False)
+        proxies = kwargs.pop("proxies", None)
+        local_files_only = kwargs.pop("local_files_only", None)
+        token = kwargs.pop("token", None)
+        revision = kwargs.pop("revision", None)
+        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
+
+        if low_cpu_mem_usage and not is_accelerate_available():
+            low_cpu_mem_usage = False
+            logger.warning(
+                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                " install accelerate\n```\n."
+            )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        user_agent = {
+            "file_type": "attn_procs_weights",
+            "framework": "pytorch",
+        }
+
+        if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+            model_file = _get_model_file(
+                pretrained_model_name_or_path_or_dict,
+                weights_name=weight_name,
+                cache_dir=cache_dir,
+                force_download=force_download,
+                proxies=proxies,
+                local_files_only=local_files_only,
+                token=token,
+                revision=revision,
+                subfolder=subfolder,
+                user_agent=user_agent,
+            )
+            if weight_name.endswith(".safetensors"):
+                state_dict = {"image_proj": {}, "ip_adapter": {}}
+                with safe_open(model_file, framework="pt", device="cpu") as f:
+                    for key in f.keys():
+                        if key.startswith("image_proj."):
+                            state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
+                        elif key.startswith("ip_adapter."):
+                            state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
+            else:
+                state_dict = load_state_dict(model_file)
+        else:
+            state_dict = pretrained_model_name_or_path_or_dict
+
+        keys = list(state_dict.keys())
+        if "image_proj" not in keys and "ip_adapter" not in keys:
+            raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
+
+        # Load image_encoder and feature_extractor here if they haven't been registered to the pipeline yet
+        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
+            if image_encoder_folder is not None:
+                if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+                    logger.info(f"loading image_encoder from {pretrained_model_name_or_path_or_dict}")
+                    if image_encoder_folder.count("/") == 0:
+                        image_encoder_subfolder = Path(subfolder, image_encoder_folder).as_posix()
+                    else:
+                        image_encoder_subfolder = Path(image_encoder_folder).as_posix()
+
+                    # Commons args for loading image encoder and image processor
+                    kwargs = {
+                        "low_cpu_mem_usage": low_cpu_mem_usage,
+                        "cache_dir": cache_dir,
+                        "local_files_only": local_files_only,
+                    }
+
+                    self.register_modules(
+                        feature_extractor=SiglipImageProcessor.from_pretrained(image_encoder_subfolder, **kwargs),
+                        image_encoder=SiglipVisionModel.from_pretrained(
+                            image_encoder_subfolder, torch_dtype=self.dtype, **kwargs
+                        ).to(self.device),
+                    )
+                else:
+                    raise ValueError(
+                        "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
+                    )
+            else:
+                logger.warning(
+                    "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
+                    "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
+                )
+
+        # Load IP-Adapter into transformer
+        self.transformer._load_ip_adapter_weights(state_dict, low_cpu_mem_usage=low_cpu_mem_usage)
+
+    def set_ip_adapter_scale(self, scale: float) -> None:
+        """
+        Set IP-Adapter scale, which controls image prompt conditioning. A value of 1.0 means the model is only
+        conditioned on the image prompt, and 0.0 only conditioned by the text prompt. Lowering this value encourages
+        the model to produce more diverse images, but they may not be as aligned with the image prompt.
+
+        Example:
+
+        ```python
+        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
+        >>> pipeline.set_ip_adapter_scale(0.6)
+        >>> ...
+        ```
+
+        Args:
+            scale (float):
+                IP-Adapter scale to be set.
+
+        """
+        for attn_processor in self.transformer.attn_processors.values():
+            if isinstance(attn_processor, SD3IPAdapterJointAttnProcessor2_0):
+                attn_processor.scale = scale
+
+    def unload_ip_adapter(self) -> None:
+        """
+        Unloads the IP Adapter weights.
+
+        Example:
+
+        ```python
+        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
+        >>> pipeline.unload_ip_adapter()
+        >>> ...
+        ```
+        """
+        # Remove image encoder
+        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
+            self.image_encoder = None
+            self.register_to_config(image_encoder=None)
+
+        # Remove feature extractor
+        if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
+            self.feature_extractor = None
+            self.register_to_config(feature_extractor=None)
+
+        # Remove image projection
+        self.transformer.image_proj = None
+
+        # Restore original attention processors layers
+        attn_procs = {
+            name: (
+                JointAttnProcessor2_0() if isinstance(value, SD3IPAdapterJointAttnProcessor2_0) else value.__class__()
+            )
+            for name, value in self.transformer.attn_processors.items()
+        }
+        self.transformer.set_attn_processor(attn_procs)
@@ -0,0 +1,168 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from contextlib import nullcontext
+
+from ...models.embeddings import ImageProjection, MultiIPAdapterImageProjection
+from ...models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
+from ...utils import is_accelerate_available, is_torch_version, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+class FluxTransformer2DLoadersMixin:
+    """
+    Load layers into a [`FluxTransformer2DModel`].
+    """
+
+    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+        if low_cpu_mem_usage:
+            if is_accelerate_available():
+                from accelerate import init_empty_weights
+
+            else:
+                low_cpu_mem_usage = False
+                logger.warning(
+                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                    " install accelerate\n```\n."
+                )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        updated_state_dict = {}
+        image_projection = None
+        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
+
+        if "proj.weight" in state_dict:
+            # IP-Adapter
+            num_image_text_embeds = 4
+            if state_dict["proj.weight"].shape[0] == 65536:
+                num_image_text_embeds = 16
+            clip_embeddings_dim = state_dict["proj.weight"].shape[-1]
+            cross_attention_dim = state_dict["proj.weight"].shape[0] // num_image_text_embeds
+
+            with init_context():
+                image_projection = ImageProjection(
+                    cross_attention_dim=cross_attention_dim,
+                    image_embed_dim=clip_embeddings_dim,
+                    num_image_text_embeds=num_image_text_embeds,
+                )
+
+            for key, value in state_dict.items():
+                diffusers_name = key.replace("proj", "image_embeds")
+                updated_state_dict[diffusers_name] = value
+
+        if not low_cpu_mem_usage:
+            image_projection.load_state_dict(updated_state_dict, strict=True)
+        else:
+            device_map = {"": self.device}
+            load_model_dict_into_meta(image_projection, updated_state_dict, device_map=device_map, dtype=self.dtype)
+
+        return image_projection
+
+    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+        from ...models.attention_processor import FluxIPAdapterJointAttnProcessor2_0
+
+        if low_cpu_mem_usage:
+            if is_accelerate_available():
+                from accelerate import init_empty_weights
+
+            else:
+                low_cpu_mem_usage = False
+                logger.warning(
+                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                    " install accelerate\n```\n."
+                )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        # set ip-adapter cross-attention processors & load state_dict
+        attn_procs = {}
+        key_id = 0
+        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
+        for name in self.attn_processors.keys():
+            if name.startswith("single_transformer_blocks"):
+                attn_processor_class = self.attn_processors[name].__class__
+                attn_procs[name] = attn_processor_class()
+            else:
+                cross_attention_dim = self.config.joint_attention_dim
+                hidden_size = self.inner_dim
+                attn_processor_class = FluxIPAdapterJointAttnProcessor2_0
+                num_image_text_embeds = []
+                for state_dict in state_dicts:
+                    if "proj.weight" in state_dict["image_proj"]:
+                        num_image_text_embed = 4
+                        if state_dict["image_proj"]["proj.weight"].shape[0] == 65536:
+                            num_image_text_embed = 16
+                        # IP-Adapter
+                        num_image_text_embeds += [num_image_text_embed]
+
+                with init_context():
+                    attn_procs[name] = attn_processor_class(
+                        hidden_size=hidden_size,
+                        cross_attention_dim=cross_attention_dim,
+                        scale=1.0,
+                        num_tokens=num_image_text_embeds,
+                        dtype=self.dtype,
+                        device=self.device,
+                    )
+
+                value_dict = {}
+                for i, state_dict in enumerate(state_dicts):
+                    value_dict.update({f"to_k_ip.{i}.weight": state_dict["ip_adapter"][f"{key_id}.to_k_ip.weight"]})
+                    value_dict.update({f"to_v_ip.{i}.weight": state_dict["ip_adapter"][f"{key_id}.to_v_ip.weight"]})
+                    value_dict.update({f"to_k_ip.{i}.bias": state_dict["ip_adapter"][f"{key_id}.to_k_ip.bias"]})
+                    value_dict.update({f"to_v_ip.{i}.bias": state_dict["ip_adapter"][f"{key_id}.to_v_ip.bias"]})
+
+                if not low_cpu_mem_usage:
+                    attn_procs[name].load_state_dict(value_dict)
+                else:
+                    device_map = {"": self.device}
+                    dtype = self.dtype
+                    load_model_dict_into_meta(attn_procs[name], value_dict, device_map=device_map, dtype=dtype)
+
+                key_id += 1
+
+        return attn_procs
+
+    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
+        if not isinstance(state_dicts, list):
+            state_dicts = [state_dicts]
+
+        self.encoder_hid_proj = None
+
+        attn_procs = self._convert_ip_adapter_attn_to_diffusers(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
+        self.set_attn_processor(attn_procs)
+
+        image_projection_layers = []
+        for state_dict in state_dicts:
+            image_projection_layer = self._convert_ip_adapter_image_proj_to_diffusers(
+                state_dict["image_proj"], low_cpu_mem_usage=low_cpu_mem_usage
+            )
+            image_projection_layers.append(image_projection_layer)
+
+        self.encoder_hid_proj = MultiIPAdapterImageProjection(image_projection_layers)
+        self.config.encoder_hid_dim_type = "ip_image_proj"
@@ -0,0 +1,170 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from contextlib import nullcontext
+from typing import Dict
+
+from ...models.attention_processor import SD3IPAdapterJointAttnProcessor2_0
+from ...models.embeddings import IPAdapterTimeImageProjection
+from ...models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
+from ...utils import is_accelerate_available, is_torch_version, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+class SD3Transformer2DLoadersMixin:
+    """Load IP-Adapters and LoRA layers into a `[SD3Transformer2DModel]`."""
+
+    def _convert_ip_adapter_attn_to_diffusers(
+        self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT
+    ) -> Dict:
+        if low_cpu_mem_usage:
+            if is_accelerate_available():
+                from accelerate import init_empty_weights
+
+            else:
+                low_cpu_mem_usage = False
+                logger.warning(
+                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                    " install accelerate\n```\n."
+                )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        # IP-Adapter cross attention parameters
+        hidden_size = self.config.attention_head_dim * self.config.num_attention_heads
+        ip_hidden_states_dim = self.config.attention_head_dim * self.config.num_attention_heads
+        timesteps_emb_dim = state_dict["0.norm_ip.linear.weight"].shape[1]
+
+        # Dict where key is transformer layer index, value is attention processor's state dict
+        # ip_adapter state dict keys example: "0.norm_ip.linear.weight"
+        layer_state_dict = {idx: {} for idx in range(len(self.attn_processors))}
+        for key, weights in state_dict.items():
+            idx, name = key.split(".", maxsplit=1)
+            layer_state_dict[int(idx)][name] = weights
+
+        # Create IP-Adapter attention processor & load state_dict
+        attn_procs = {}
+        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
+        for idx, name in enumerate(self.attn_processors.keys()):
+            with init_context():
+                attn_procs[name] = SD3IPAdapterJointAttnProcessor2_0(
+                    hidden_size=hidden_size,
+                    ip_hidden_states_dim=ip_hidden_states_dim,
+                    head_dim=self.config.attention_head_dim,
+                    timesteps_emb_dim=timesteps_emb_dim,
+                )
+
+            if not low_cpu_mem_usage:
+                attn_procs[name].load_state_dict(layer_state_dict[idx], strict=True)
+            else:
+                device_map = {"": self.device}
+                load_model_dict_into_meta(
+                    attn_procs[name], layer_state_dict[idx], device_map=device_map, dtype=self.dtype
+                )
+
+        return attn_procs
+
+    def _convert_ip_adapter_image_proj_to_diffusers(
+        self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT
+    ) -> IPAdapterTimeImageProjection:
+        if low_cpu_mem_usage:
+            if is_accelerate_available():
+                from accelerate import init_empty_weights
+
+            else:
+                low_cpu_mem_usage = False
+                logger.warning(
+                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
+                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
+                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
+                    " install accelerate\n```\n."
+                )
+
+        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
+            raise NotImplementedError(
+                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
+                " `low_cpu_mem_usage=False`."
+            )
+
+        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
+
+        # Convert to diffusers
+        updated_state_dict = {}
+        for key, value in state_dict.items():
+            # InstantX/SD3.5-Large-IP-Adapter
+            if key.startswith("layers."):
+                idx = key.split(".")[1]
+                key = key.replace(f"layers.{idx}.0.norm1", f"layers.{idx}.ln0")
+                key = key.replace(f"layers.{idx}.0.norm2", f"layers.{idx}.ln1")
+                key = key.replace(f"layers.{idx}.0.to_q", f"layers.{idx}.attn.to_q")
+                key = key.replace(f"layers.{idx}.0.to_kv", f"layers.{idx}.attn.to_kv")
+                key = key.replace(f"layers.{idx}.0.to_out", f"layers.{idx}.attn.to_out.0")
+                key = key.replace(f"layers.{idx}.1.0", f"layers.{idx}.adaln_norm")
+                key = key.replace(f"layers.{idx}.1.1", f"layers.{idx}.ff.net.0.proj")
+                key = key.replace(f"layers.{idx}.1.3", f"layers.{idx}.ff.net.2")
+                key = key.replace(f"layers.{idx}.2.1", f"layers.{idx}.adaln_proj")
+            updated_state_dict[key] = value
+
+        # Image projetion parameters
+        embed_dim = updated_state_dict["proj_in.weight"].shape[1]
+        output_dim = updated_state_dict["proj_out.weight"].shape[0]
+        hidden_dim = updated_state_dict["proj_in.weight"].shape[0]
+        heads = updated_state_dict["layers.0.attn.to_q.weight"].shape[0] // 64
+        num_queries = updated_state_dict["latents"].shape[1]
+        timestep_in_dim = updated_state_dict["time_embedding.linear_1.weight"].shape[1]
+
+        # Image projection
+        with init_context():
+            image_proj = IPAdapterTimeImageProjection(
+                embed_dim=embed_dim,
+                output_dim=output_dim,
+                hidden_dim=hidden_dim,
+                heads=heads,
+                num_queries=num_queries,
+                timestep_in_dim=timestep_in_dim,
+            )
+
+        if not low_cpu_mem_usage:
+            image_proj.load_state_dict(updated_state_dict, strict=True)
+        else:
+            device_map = {"": self.device}
+            load_model_dict_into_meta(image_proj, updated_state_dict, device_map=device_map, dtype=self.dtype)
+
+        return image_proj
+
+    def _load_ip_adapter_weights(self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT) -> None:
+        """Sets IP-Adapter attention processors, image projection, and loads state_dict.
+
+        Args:
+            state_dict (`Dict`):
+                State dict with keys "ip_adapter", which contains parameters for attention processors, and
+                "image_proj", which contains parameters for image projection net.
+            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
+                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
+                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
+                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
+                argument to `True` will raise an error.
+        """
+
+        attn_procs = self._convert_ip_adapter_attn_to_diffusers(state_dict["ip_adapter"], low_cpu_mem_usage)
+        self.set_attn_processor(attn_procs)
+
+        self.image_proj = self._convert_ip_adapter_image_proj_to_diffusers(state_dict["image_proj"], low_cpu_mem_usage)
@@ -0,0 +1,25 @@
+from ...utils import is_peft_available, is_torch_available, is_transformers_available
+
+
+if is_torch_available():
+    from .lora_base import LoraBaseMixin
+
+    if is_transformers_available():
+        from .lora_pipeline import (
+            AmusedLoraLoaderMixin,
+            AuraFlowLoraLoaderMixin,
+            CogVideoXLoraLoaderMixin,
+            CogView4LoraLoaderMixin,
+            FluxLoraLoaderMixin,
+            HiDreamImageLoraLoaderMixin,
+            HunyuanVideoLoraLoaderMixin,
+            LoraLoaderMixin,
+            LTXVideoLoraLoaderMixin,
+            Lumina2LoraLoaderMixin,
+            Mochi1LoraLoaderMixin,
+            SanaLoraLoaderMixin,
+            SD3LoraLoaderMixin,
+            StableDiffusionLoraLoaderMixin,
+            StableDiffusionXLLoraLoaderMixin,
+            WanLoraLoaderMixin,
+        )
@@ -0,0 +1,935 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import inspect
+import os
+from pathlib import Path
+from typing import Callable, Dict, List, Optional, Union
+
+import safetensors
+import torch
+import torch.nn as nn
+from huggingface_hub import model_info
+from huggingface_hub.constants import HF_HUB_OFFLINE
+
+from ...models.modeling_utils import ModelMixin, load_state_dict
+from ...utils import (
+    USE_PEFT_BACKEND,
+    _get_model_file,
+    convert_state_dict_to_diffusers,
+    convert_state_dict_to_peft,
+    delete_adapter_layers,
+    deprecate,
+    get_adapter_name,
+    get_peft_kwargs,
+    is_accelerate_available,
+    is_peft_available,
+    is_peft_version,
+    is_transformers_available,
+    is_transformers_version,
+    logging,
+    recurse_remove_peft_layers,
+    scale_lora_layers,
+    set_adapter_layers,
+    set_weights_and_activate_adapters,
+)
+
+
+if is_transformers_available():
+    from transformers import PreTrainedModel
+
+    from ...models.lora import text_encoder_attn_modules, text_encoder_mlp_modules
+
+if is_peft_available():
+    from peft.tuners.tuners_utils import BaseTunerLayer
+
+if is_accelerate_available():
+    from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module
+
+logger = logging.get_logger(__name__)
+
+LORA_WEIGHT_NAME = "pytorch_lora_weights.bin"
+LORA_WEIGHT_NAME_SAFE = "pytorch_lora_weights.safetensors"
+
+
+def fuse_text_encoder_lora(text_encoder, lora_scale=1.0, safe_fusing=False, adapter_names=None):
+    """
+    Fuses LoRAs for the text encoder.
+
+    Args:
+        text_encoder (`torch.nn.Module`):
+            The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder`
+            attribute.
+        lora_scale (`float`, defaults to 1.0):
+            Controls how much to influence the outputs with the LoRA parameters.
+        safe_fusing (`bool`, defaults to `False`):
+            Whether to check fused weights for NaN values before fusing and if values are NaN not fusing them.
+        adapter_names (`List[str]` or `str`):
+            The names of the adapters to use.
+    """
+    merge_kwargs = {"safe_merge": safe_fusing}
+
+    for module in text_encoder.modules():
+        if isinstance(module, BaseTunerLayer):
+            if lora_scale != 1.0:
+                module.scale_layer(lora_scale)
+
+            # For BC with previous PEFT versions, we need to check the signature
+            # of the `merge` method to see if it supports the `adapter_names` argument.
+            supported_merge_kwargs = list(inspect.signature(module.merge).parameters)
+            if "adapter_names" in supported_merge_kwargs:
+                merge_kwargs["adapter_names"] = adapter_names
+            elif "adapter_names" not in supported_merge_kwargs and adapter_names is not None:
+                raise ValueError(
+                    "The `adapter_names` argument is not supported with your PEFT version. "
+                    "Please upgrade to the latest version of PEFT. `pip install -U peft`"
+                )
+
+            module.merge(**merge_kwargs)
+
+
+def unfuse_text_encoder_lora(text_encoder):
+    """
+    Unfuses LoRAs for the text encoder.
+
+    Args:
+        text_encoder (`torch.nn.Module`):
+            The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder`
+            attribute.
+    """
+    for module in text_encoder.modules():
+        if isinstance(module, BaseTunerLayer):
+            module.unmerge()
+
+
+def set_adapters_for_text_encoder(
+    adapter_names: Union[List[str], str],
+    text_encoder: Optional["PreTrainedModel"] = None,  # noqa: F821
+    text_encoder_weights: Optional[Union[float, List[float], List[None]]] = None,
+):
+    """
+    Sets the adapter layers for the text encoder.
+
+    Args:
+        adapter_names (`List[str]` or `str`):
+            The names of the adapters to use.
+        text_encoder (`torch.nn.Module`, *optional*):
+            The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder`
+            attribute.
+        text_encoder_weights (`List[float]`, *optional*):
+            The weights to use for the text encoder. If `None`, the weights are set to `1.0` for all the adapters.
+    """
+    if text_encoder is None:
+        raise ValueError(
+            "The pipeline does not have a default `pipe.text_encoder` class. Please make sure to pass a `text_encoder` instead."
+        )
+
+    def process_weights(adapter_names, weights):
+        # Expand weights into a list, one entry per adapter
+        # e.g. for 2 adapters:  7 -> [7,7] ; [3, None] -> [3, None]
+        if not isinstance(weights, list):
+            weights = [weights] * len(adapter_names)
+
+        if len(adapter_names) != len(weights):
+            raise ValueError(
+                f"Length of adapter names {len(adapter_names)} is not equal to the length of the weights {len(weights)}"
+            )
+
+        # Set None values to default of 1.0
+        # e.g. [7,7] -> [7,7] ; [3, None] -> [3,1]
+        weights = [w if w is not None else 1.0 for w in weights]
+
+        return weights
+
+    adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names
+    text_encoder_weights = process_weights(adapter_names, text_encoder_weights)
+    set_weights_and_activate_adapters(text_encoder, adapter_names, text_encoder_weights)
+
+
+def disable_lora_for_text_encoder(text_encoder: Optional["PreTrainedModel"] = None):
+    """
+    Disables the LoRA layers for the text encoder.
+
+    Args:
+        text_encoder (`torch.nn.Module`, *optional*):
+            The text encoder module to disable the LoRA layers for. If `None`, it will try to get the `text_encoder`
+            attribute.
+    """
+    if text_encoder is None:
+        raise ValueError("Text Encoder not found.")
+    set_adapter_layers(text_encoder, enabled=False)
+
+
+def enable_lora_for_text_encoder(text_encoder: Optional["PreTrainedModel"] = None):
+    """
+    Enables the LoRA layers for the text encoder.
+
+    Args:
+        text_encoder (`torch.nn.Module`, *optional*):
+            The text encoder module to enable the LoRA layers for. If `None`, it will try to get the `text_encoder`
+            attribute.
+    """
+    if text_encoder is None:
+        raise ValueError("Text Encoder not found.")
+    set_adapter_layers(text_encoder, enabled=True)
+
+
+def _remove_text_encoder_monkey_patch(text_encoder):
+    recurse_remove_peft_layers(text_encoder)
+    if getattr(text_encoder, "peft_config", None) is not None:
+        del text_encoder.peft_config
+        text_encoder._hf_peft_config_loaded = None
+
+
+def _fetch_state_dict(
+    pretrained_model_name_or_path_or_dict,
+    weight_name,
+    use_safetensors,
+    local_files_only,
+    cache_dir,
+    force_download,
+    proxies,
+    token,
+    revision,
+    subfolder,
+    user_agent,
+    allow_pickle,
+):
+    model_file = None
+    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
+        # Let's first try to load .safetensors weights
+        if (use_safetensors and weight_name is None) or (
+            weight_name is not None and weight_name.endswith(".safetensors")
+        ):
+            try:
+                # Here we're relaxing the loading check to enable more Inference API
+                # friendliness where sometimes, it's not at all possible to automatically
+                # determine `weight_name`.
+                if weight_name is None:
+                    weight_name = _best_guess_weight_name(
+                        pretrained_model_name_or_path_or_dict,
+                        file_extension=".safetensors",
+                        local_files_only=local_files_only,
+                    )
+                model_file = _get_model_file(
+                    pretrained_model_name_or_path_or_dict,
+                    weights_name=weight_name or LORA_WEIGHT_NAME_SAFE,
+                    cache_dir=cache_dir,
+                    force_download=force_download,
+                    proxies=proxies,
+                    local_files_only=local_files_only,
+                    token=token,
+                    revision=revision,
+                    subfolder=subfolder,
+                    user_agent=user_agent,
+                )
+                state_dict = safetensors.torch.load_file(model_file, device="cpu")
+            except (IOError, safetensors.SafetensorError) as e:
+                if not allow_pickle:
+                    raise e
+                # try loading non-safetensors weights
+                model_file = None
+                pass
+
+        if model_file is None:
+            if weight_name is None:
+                weight_name = _best_guess_weight_name(
+                    pretrained_model_name_or_path_or_dict, file_extension=".bin", local_files_only=local_files_only
+                )
+            model_file = _get_model_file(
+                pretrained_model_name_or_path_or_dict,
+                weights_name=weight_name or LORA_WEIGHT_NAME,
+                cache_dir=cache_dir,
+                force_download=force_download,
+                proxies=proxies,
+                local_files_only=local_files_only,
+                token=token,
+                revision=revision,
+                subfolder=subfolder,
+                user_agent=user_agent,
+            )
+            state_dict = load_state_dict(model_file)
+    else:
+        state_dict = pretrained_model_name_or_path_or_dict
+
+    return state_dict
+
+
+def _best_guess_weight_name(
+    pretrained_model_name_or_path_or_dict, file_extension=".safetensors", local_files_only=False
+):
+    if local_files_only or HF_HUB_OFFLINE:
+        raise ValueError("When using the offline mode, you must specify a `weight_name`.")
+
+    targeted_files = []
+
+    if os.path.isfile(pretrained_model_name_or_path_or_dict):
+        return
+    elif os.path.isdir(pretrained_model_name_or_path_or_dict):
+        targeted_files = [f for f in os.listdir(pretrained_model_name_or_path_or_dict) if f.endswith(file_extension)]
+    else:
+        files_in_repo = model_info(pretrained_model_name_or_path_or_dict).siblings
+        targeted_files = [f.rfilename for f in files_in_repo if f.rfilename.endswith(file_extension)]
+    if len(targeted_files) == 0:
+        return
+
+    # "scheduler" does not correspond to a LoRA checkpoint.
+    # "optimizer" does not correspond to a LoRA checkpoint
+    # only top-level checkpoints are considered and not the other ones, hence "checkpoint".
+    unallowed_substrings = {"scheduler", "optimizer", "checkpoint"}
+    targeted_files = list(
+        filter(lambda x: all(substring not in x for substring in unallowed_substrings), targeted_files)
+    )
+
+    if any(f.endswith(LORA_WEIGHT_NAME) for f in targeted_files):
+        targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME), targeted_files))
+    elif any(f.endswith(LORA_WEIGHT_NAME_SAFE) for f in targeted_files):
+        targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME_SAFE), targeted_files))
+
+    if len(targeted_files) > 1:
+        raise ValueError(
+            f"Provided path contains more than one weights file in the {file_extension} format. Either specify `weight_name` in `load_lora_weights` or make sure there's only one  `.safetensors` or `.bin` file in  {pretrained_model_name_or_path_or_dict}."
+        )
+    weight_name = targeted_files[0]
+    return weight_name
+
+
+def _load_lora_into_text_encoder(
+    state_dict,
+    network_alphas,
+    text_encoder,
+    prefix=None,
+    lora_scale=1.0,
+    text_encoder_name="text_encoder",
+    adapter_name=None,
+    _pipeline=None,
+    low_cpu_mem_usage=False,
+    hotswap: bool = False,
+):
+    if not USE_PEFT_BACKEND:
+        raise ValueError("PEFT backend is required for this method.")
+
+    peft_kwargs = {}
+    if low_cpu_mem_usage:
+        if not is_peft_version(">=", "0.13.1"):
+            raise ValueError(
+                "`low_cpu_mem_usage=True` is not compatible with this `peft` version. Please update it with `pip install -U peft`."
+            )
+        if not is_transformers_version(">", "4.45.2"):
+            # Note from sayakpaul: It's not in `transformers` stable yet.
+            # https://github.com/huggingface/transformers/pull/33725/
+            raise ValueError(
+                "`low_cpu_mem_usage=True` is not compatible with this `transformers` version. Please update it with `pip install -U transformers`."
+            )
+        peft_kwargs["low_cpu_mem_usage"] = low_cpu_mem_usage
+
+    from peft import LoraConfig
+
+    # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918),
+    # then the `state_dict` keys should have `unet_name` and/or `text_encoder_name` as
+    # their prefixes.
+    prefix = text_encoder_name if prefix is None else prefix
+
+    # Safe prefix to check with.
+    if hotswap and any(text_encoder_name in key for key in state_dict.keys()):
+        raise ValueError("At the moment, hotswapping is not supported for text encoders, please pass `hotswap=False`.")
+
+    # Load the layers corresponding to text encoder and make necessary adjustments.
+    if prefix is not None:
+        state_dict = {k[len(f"{prefix}.") :]: v for k, v in state_dict.items() if k.startswith(f"{prefix}.")}
+
+    if len(state_dict) > 0:
+        logger.info(f"Loading {prefix}.")
+        rank = {}
+        state_dict = convert_state_dict_to_diffusers(state_dict)
+
+        # convert state dict
+        state_dict = convert_state_dict_to_peft(state_dict)
+
+        for name, _ in text_encoder_attn_modules(text_encoder):
+            for module in ("out_proj", "q_proj", "k_proj", "v_proj"):
+                rank_key = f"{name}.{module}.lora_B.weight"
+                if rank_key not in state_dict:
+                    continue
+                rank[rank_key] = state_dict[rank_key].shape[1]
+
+        for name, _ in text_encoder_mlp_modules(text_encoder):
+            for module in ("fc1", "fc2"):
+                rank_key = f"{name}.{module}.lora_B.weight"
+                if rank_key not in state_dict:
+                    continue
+                rank[rank_key] = state_dict[rank_key].shape[1]
+
+        if network_alphas is not None:
+            alpha_keys = [k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix]
+            network_alphas = {k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys}
+
+        lora_config_kwargs = get_peft_kwargs(rank, network_alphas, state_dict, is_unet=False)
+
+        if "use_dora" in lora_config_kwargs:
+            if lora_config_kwargs["use_dora"]:
+                if is_peft_version("<", "0.9.0"):
+                    raise ValueError(
+                        "You need `peft` 0.9.0 at least to use DoRA-enabled LoRAs. Please upgrade your installation of `peft`."
+                    )
+            else:
+                if is_peft_version("<", "0.9.0"):
+                    lora_config_kwargs.pop("use_dora")
+
+        if "lora_bias" in lora_config_kwargs:
+            if lora_config_kwargs["lora_bias"]:
+                if is_peft_version("<=", "0.13.2"):
+                    raise ValueError(
+                        "You need `peft` 0.14.0 at least to use `bias` in LoRAs. Please upgrade your installation of `peft`."
+                    )
+            else:
+                if is_peft_version("<=", "0.13.2"):
+                    lora_config_kwargs.pop("lora_bias")
+
+        lora_config = LoraConfig(**lora_config_kwargs)
+
+        # adapter_name
+        if adapter_name is None:
+            adapter_name = get_adapter_name(text_encoder)
+
+        is_model_cpu_offload, is_sequential_cpu_offload = _func_optionally_disable_offloading(_pipeline)
+
+        # inject LoRA layers and load the state dict
+        # in transformers we automatically check whether the adapter name is already in use or not
+        text_encoder.load_adapter(
+            adapter_name=adapter_name,
+            adapter_state_dict=state_dict,
+            peft_config=lora_config,
+            **peft_kwargs,
+        )
+
+        # scale LoRA layers with `lora_scale`
+        scale_lora_layers(text_encoder, weight=lora_scale)
+
+        text_encoder.to(device=text_encoder.device, dtype=text_encoder.dtype)
+
+        # Offload back.
+        if is_model_cpu_offload:
+            _pipeline.enable_model_cpu_offload()
+        elif is_sequential_cpu_offload:
+            _pipeline.enable_sequential_cpu_offload()
+        # Unsafe code />
+
+    if prefix is not None and not state_dict:
+        logger.warning(
+            f"No LoRA keys associated to {text_encoder.__class__.__name__} found with the {prefix=}. "
+            "This is safe to ignore if LoRA state dict didn't originally have any "
+            f"{text_encoder.__class__.__name__} related params. You can also try specifying `prefix=None` "
+            "to resolve the warning. Otherwise, open an issue if you think it's unexpected: "
+            "https://github.com/huggingface/diffusers/issues/new"
+        )
+
+
+def _func_optionally_disable_offloading(_pipeline):
+    is_model_cpu_offload = False
+    is_sequential_cpu_offload = False
+
+    if _pipeline is not None and _pipeline.hf_device_map is None:
+        for _, component in _pipeline.components.items():
+            if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
+                if not is_model_cpu_offload:
+                    is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
+                if not is_sequential_cpu_offload:
+                    is_sequential_cpu_offload = (
+                        isinstance(component._hf_hook, AlignDevicesHook)
+                        or hasattr(component._hf_hook, "hooks")
+                        and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
+                    )
+
+                logger.info(
+                    "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again."
+                )
+                remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
+
+    return (is_model_cpu_offload, is_sequential_cpu_offload)
+
+
+class LoraBaseMixin:
+    """Utility class for handling LoRAs."""
+
+    _lora_loadable_modules = []
+    num_fused_loras = 0
+
+    def load_lora_weights(self, **kwargs):
+        raise NotImplementedError("`load_lora_weights()` is not implemented.")
+
+    @classmethod
+    def save_lora_weights(cls, **kwargs):
+        raise NotImplementedError("`save_lora_weights()` not implemented.")
+
+    @classmethod
+    def lora_state_dict(cls, **kwargs):
+        raise NotImplementedError("`lora_state_dict()` is not implemented.")
+
+    @classmethod
+    def _optionally_disable_offloading(cls, _pipeline):
+        """
+        Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU.
+
+        Args:
+            _pipeline (`DiffusionPipeline`):
+                The pipeline to disable offloading for.
+
+        Returns:
+            tuple:
+                A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
+        """
+        return _func_optionally_disable_offloading(_pipeline=_pipeline)
+
+    @classmethod
+    def _fetch_state_dict(cls, *args, **kwargs):
+        deprecation_message = f"Using the `_fetch_state_dict()` method from {cls} has been deprecated and will be removed in a future version. Please use `from diffusers.loaders.lora_base import _fetch_state_dict`."
+        deprecate("_fetch_state_dict", "0.35.0", deprecation_message)
+        return _fetch_state_dict(*args, **kwargs)
+
+    @classmethod
+    def _best_guess_weight_name(cls, *args, **kwargs):
+        deprecation_message = f"Using the `_best_guess_weight_name()` method from {cls} has been deprecated and will be removed in a future version. Please use `from diffusers.loaders.lora_base import _best_guess_weight_name`."
+        deprecate("_best_guess_weight_name", "0.35.0", deprecation_message)
+        return _best_guess_weight_name(*args, **kwargs)
+
+    def unload_lora_weights(self):
+        """
+        Unloads the LoRA parameters.
+
+        Examples:
+
+        ```python
+        >>> # Assuming `pipeline` is already loaded with the LoRA parameters.
+        >>> pipeline.unload_lora_weights()
+        >>> ...
+        ```
+        """
+        if not USE_PEFT_BACKEND:
+            raise ValueError("PEFT backend is required for this method.")
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if model is not None:
+                if issubclass(model.__class__, ModelMixin):
+                    model.unload_lora()
+                elif issubclass(model.__class__, PreTrainedModel):
+                    _remove_text_encoder_monkey_patch(model)
+
+    def fuse_lora(
+        self,
+        components: List[str] = [],
+        lora_scale: float = 1.0,
+        safe_fusing: bool = False,
+        adapter_names: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        r"""
+        Fuses the LoRA parameters into the original parameters of the corresponding blocks.
+
+        <Tip warning={true}>
+
+        This is an experimental API.
+
+        </Tip>
+
+        Args:
+            components: (`List[str]`): List of LoRA-injectable components to fuse the LoRAs into.
+            lora_scale (`float`, defaults to 1.0):
+                Controls how much to influence the outputs with the LoRA parameters.
+            safe_fusing (`bool`, defaults to `False`):
+                Whether to check fused weights for NaN values before fusing and if values are NaN not fusing them.
+            adapter_names (`List[str]`, *optional*):
+                Adapter names to be used for fusing. If nothing is passed, all active adapters will be fused.
+
+        Example:
+
+        ```py
+        from diffusers import DiffusionPipeline
+        import torch
+
+        pipeline = DiffusionPipeline.from_pretrained(
+            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
+        ).to("cuda")
+        pipeline.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
+        pipeline.fuse_lora(lora_scale=0.7)
+        ```
+        """
+        if "fuse_unet" in kwargs:
+            depr_message = "Passing `fuse_unet` to `fuse_lora()` is deprecated and will be ignored. Please use the `components` argument and provide a list of the components whose LoRAs are to be fused. `fuse_unet` will be removed in a future version."
+            deprecate(
+                "fuse_unet",
+                "1.0.0",
+                depr_message,
+            )
+        if "fuse_transformer" in kwargs:
+            depr_message = "Passing `fuse_transformer` to `fuse_lora()` is deprecated and will be ignored. Please use the `components` argument and provide a list of the components whose LoRAs are to be fused. `fuse_transformer` will be removed in a future version."
+            deprecate(
+                "fuse_transformer",
+                "1.0.0",
+                depr_message,
+            )
+        if "fuse_text_encoder" in kwargs:
+            depr_message = "Passing `fuse_text_encoder` to `fuse_lora()` is deprecated and will be ignored. Please use the `components` argument and provide a list of the components whose LoRAs are to be fused. `fuse_text_encoder` will be removed in a future version."
+            deprecate(
+                "fuse_text_encoder",
+                "1.0.0",
+                depr_message,
+            )
+
+        if len(components) == 0:
+            raise ValueError("`components` cannot be an empty list.")
+
+        for fuse_component in components:
+            if fuse_component not in self._lora_loadable_modules:
+                raise ValueError(f"{fuse_component} is not found in {self._lora_loadable_modules=}.")
+
+            model = getattr(self, fuse_component, None)
+            if model is not None:
+                # check if diffusers model
+                if issubclass(model.__class__, ModelMixin):
+                    model.fuse_lora(lora_scale, safe_fusing=safe_fusing, adapter_names=adapter_names)
+                # handle transformers models.
+                if issubclass(model.__class__, PreTrainedModel):
+                    fuse_text_encoder_lora(
+                        model, lora_scale=lora_scale, safe_fusing=safe_fusing, adapter_names=adapter_names
+                    )
+
+        self.num_fused_loras += 1
+
+    def unfuse_lora(self, components: List[str] = [], **kwargs):
+        r"""
+        Reverses the effect of
+        [`pipe.fuse_lora()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraBaseMixin.fuse_lora).
+
+        <Tip warning={true}>
+
+        This is an experimental API.
+
+        </Tip>
+
+        Args:
+            components (`List[str]`): List of LoRA-injectable components to unfuse LoRA from.
+            unfuse_unet (`bool`, defaults to `True`): Whether to unfuse the UNet LoRA parameters.
+            unfuse_text_encoder (`bool`, defaults to `True`):
+                Whether to unfuse the text encoder LoRA parameters. If the text encoder wasn't monkey-patched with the
+                LoRA parameters then it won't have any effect.
+        """
+        if "unfuse_unet" in kwargs:
+            depr_message = "Passing `unfuse_unet` to `unfuse_lora()` is deprecated and will be ignored. Please use the `components` argument. `unfuse_unet` will be removed in a future version."
+            deprecate(
+                "unfuse_unet",
+                "1.0.0",
+                depr_message,
+            )
+        if "unfuse_transformer" in kwargs:
+            depr_message = "Passing `unfuse_transformer` to `unfuse_lora()` is deprecated and will be ignored. Please use the `components` argument. `unfuse_transformer` will be removed in a future version."
+            deprecate(
+                "unfuse_transformer",
+                "1.0.0",
+                depr_message,
+            )
+        if "unfuse_text_encoder" in kwargs:
+            depr_message = "Passing `unfuse_text_encoder` to `unfuse_lora()` is deprecated and will be ignored. Please use the `components` argument. `unfuse_text_encoder` will be removed in a future version."
+            deprecate(
+                "unfuse_text_encoder",
+                "1.0.0",
+                depr_message,
+            )
+
+        if len(components) == 0:
+            raise ValueError("`components` cannot be an empty list.")
+
+        for fuse_component in components:
+            if fuse_component not in self._lora_loadable_modules:
+                raise ValueError(f"{fuse_component} is not found in {self._lora_loadable_modules=}.")
+
+            model = getattr(self, fuse_component, None)
+            if model is not None:
+                if issubclass(model.__class__, (ModelMixin, PreTrainedModel)):
+                    for module in model.modules():
+                        if isinstance(module, BaseTunerLayer):
+                            module.unmerge()
+
+        self.num_fused_loras -= 1
+
+    def set_adapters(
+        self,
+        adapter_names: Union[List[str], str],
+        adapter_weights: Optional[Union[float, Dict, List[float], List[Dict]]] = None,
+    ):
+        if isinstance(adapter_weights, dict):
+            components_passed = set(adapter_weights.keys())
+            lora_components = set(self._lora_loadable_modules)
+
+            invalid_components = sorted(components_passed - lora_components)
+            if invalid_components:
+                logger.warning(
+                    f"The following components in `adapter_weights` are not part of the pipeline: {invalid_components}. "
+                    f"Available components that are LoRA-compatible: {self._lora_loadable_modules}. So, weights belonging "
+                    "to the invalid components will be removed and ignored."
+                )
+                adapter_weights = {k: v for k, v in adapter_weights.items() if k not in invalid_components}
+
+        adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names
+        adapter_weights = copy.deepcopy(adapter_weights)
+
+        # Expand weights into a list, one entry per adapter
+        if not isinstance(adapter_weights, list):
+            adapter_weights = [adapter_weights] * len(adapter_names)
+
+        if len(adapter_names) != len(adapter_weights):
+            raise ValueError(
+                f"Length of adapter names {len(adapter_names)} is not equal to the length of the weights {len(adapter_weights)}"
+            )
+
+        list_adapters = self.get_list_adapters()  # eg {"unet": ["adapter1", "adapter2"], "text_encoder": ["adapter2"]}
+        # eg ["adapter1", "adapter2"]
+        all_adapters = {adapter for adapters in list_adapters.values() for adapter in adapters}
+        missing_adapters = set(adapter_names) - all_adapters
+        if len(missing_adapters) > 0:
+            raise ValueError(
+                f"Adapter name(s) {missing_adapters} not in the list of present adapters: {all_adapters}."
+            )
+
+        # eg {"adapter1": ["unet"], "adapter2": ["unet", "text_encoder"]}
+        invert_list_adapters = {
+            adapter: [part for part, adapters in list_adapters.items() if adapter in adapters]
+            for adapter in all_adapters
+        }
+
+        # Decompose weights into weights for denoiser and text encoders.
+        _component_adapter_weights = {}
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component)
+
+            for adapter_name, weights in zip(adapter_names, adapter_weights):
+                if isinstance(weights, dict):
+                    component_adapter_weights = weights.pop(component, None)
+                    if component_adapter_weights is not None and component not in invert_list_adapters[adapter_name]:
+                        logger.warning(
+                            (
+                                f"Lora weight dict for adapter '{adapter_name}' contains {component},"
+                                f"but this will be ignored because {adapter_name} does not contain weights for {component}."
+                                f"Valid parts for {adapter_name} are: {invert_list_adapters[adapter_name]}."
+                            )
+                        )
+
+                else:
+                    component_adapter_weights = weights
+
+                _component_adapter_weights.setdefault(component, [])
+                _component_adapter_weights[component].append(component_adapter_weights)
+
+            if issubclass(model.__class__, ModelMixin):
+                model.set_adapters(adapter_names, _component_adapter_weights[component])
+            elif issubclass(model.__class__, PreTrainedModel):
+                set_adapters_for_text_encoder(adapter_names, model, _component_adapter_weights[component])
+
+    def disable_lora(self):
+        if not USE_PEFT_BACKEND:
+            raise ValueError("PEFT backend is required for this method.")
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if model is not None:
+                if issubclass(model.__class__, ModelMixin):
+                    model.disable_lora()
+                elif issubclass(model.__class__, PreTrainedModel):
+                    disable_lora_for_text_encoder(model)
+
+    def enable_lora(self):
+        if not USE_PEFT_BACKEND:
+            raise ValueError("PEFT backend is required for this method.")
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if model is not None:
+                if issubclass(model.__class__, ModelMixin):
+                    model.enable_lora()
+                elif issubclass(model.__class__, PreTrainedModel):
+                    enable_lora_for_text_encoder(model)
+
+    def delete_adapters(self, adapter_names: Union[List[str], str]):
+        """
+        Args:
+        Deletes the LoRA layers of `adapter_name` for the unet and text-encoder(s).
+            adapter_names (`Union[List[str], str]`):
+                The names of the adapter to delete. Can be a single string or a list of strings
+        """
+        if not USE_PEFT_BACKEND:
+            raise ValueError("PEFT backend is required for this method.")
+
+        if isinstance(adapter_names, str):
+            adapter_names = [adapter_names]
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if model is not None:
+                if issubclass(model.__class__, ModelMixin):
+                    model.delete_adapters(adapter_names)
+                elif issubclass(model.__class__, PreTrainedModel):
+                    for adapter_name in adapter_names:
+                        delete_adapter_layers(model, adapter_name)
+
+    def get_active_adapters(self) -> List[str]:
+        """
+        Gets the list of the current active adapters.
+
+        Example:
+
+        ```python
+        from diffusers import DiffusionPipeline
+
+        pipeline = DiffusionPipeline.from_pretrained(
+            "stabilityai/stable-diffusion-xl-base-1.0",
+        ).to("cuda")
+        pipeline.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
+        pipeline.get_active_adapters()
+        ```
+        """
+        if not USE_PEFT_BACKEND:
+            raise ValueError(
+                "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`"
+            )
+
+        active_adapters = []
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if model is not None and issubclass(model.__class__, ModelMixin):
+                for module in model.modules():
+                    if isinstance(module, BaseTunerLayer):
+                        active_adapters = module.active_adapters
+                        break
+
+        return active_adapters
+
+    def get_list_adapters(self) -> Dict[str, List[str]]:
+        """
+        Gets the current list of all available adapters in the pipeline.
+        """
+        if not USE_PEFT_BACKEND:
+            raise ValueError(
+                "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`"
+            )
+
+        set_adapters = {}
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if (
+                model is not None
+                and issubclass(model.__class__, (ModelMixin, PreTrainedModel))
+                and hasattr(model, "peft_config")
+            ):
+                set_adapters[component] = list(model.peft_config.keys())
+
+        return set_adapters
+
+    def set_lora_device(self, adapter_names: List[str], device: Union[torch.device, str, int]) -> None:
+        """
+        Moves the LoRAs listed in `adapter_names` to a target device. Useful for offloading the LoRA to the CPU in case
+        you want to load multiple adapters and free some GPU memory.
+
+        Args:
+            adapter_names (`List[str]`):
+                List of adapters to send device to.
+            device (`Union[torch.device, str, int]`):
+                Device to send the adapters to. Can be either a torch device, a str or an integer.
+        """
+        if not USE_PEFT_BACKEND:
+            raise ValueError("PEFT backend is required for this method.")
+
+        for component in self._lora_loadable_modules:
+            model = getattr(self, component, None)
+            if model is not None:
+                for module in model.modules():
+                    if isinstance(module, BaseTunerLayer):
+                        for adapter_name in adapter_names:
+                            module.lora_A[adapter_name].to(device)
+                            module.lora_B[adapter_name].to(device)
+                            # this is a param, not a module, so device placement is not in-place -> re-assign
+                            if hasattr(module, "lora_magnitude_vector") and module.lora_magnitude_vector is not None:
+                                if adapter_name in module.lora_magnitude_vector:
+                                    module.lora_magnitude_vector[adapter_name] = module.lora_magnitude_vector[
+                                        adapter_name
+                                    ].to(device)
+
+    @staticmethod
+    def pack_weights(layers, prefix):
+        layers_weights = layers.state_dict() if isinstance(layers, torch.nn.Module) else layers
+        layers_state_dict = {f"{prefix}.{module_name}": param for module_name, param in layers_weights.items()}
+        return layers_state_dict
+
+    @staticmethod
+    def write_lora_layers(
+        state_dict: Dict[str, torch.Tensor],
+        save_directory: str,
+        is_main_process: bool,
+        weight_name: str,
+        save_function: Callable,
+        safe_serialization: bool,
+    ):
+        if os.path.isfile(save_directory):
+            logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
+            return
+
+        if save_function is None:
+            if safe_serialization:
+
+                def save_function(weights, filename):
+                    return safetensors.torch.save_file(weights, filename, metadata={"format": "pt"})
+
+            else:
+                save_function = torch.save
+
+        os.makedirs(save_directory, exist_ok=True)
+
+        if weight_name is None:
+            if safe_serialization:
+                weight_name = LORA_WEIGHT_NAME_SAFE
+            else:
+                weight_name = LORA_WEIGHT_NAME
+
+        save_path = Path(save_directory, weight_name).as_posix()
+        save_function(state_dict, save_path)
+        logger.info(f"Model weights saved in {save_path}")
+
+    @property
+    def lora_scale(self) -> float:
+        # property function that returns the lora scale which can be set at run time by the pipeline.
+        # if _lora_scale has not been set, return 1
+        return self._lora_scale if hasattr(self, "_lora_scale") else 1.0
+
+    def enable_lora_hotswap(self, **kwargs) -> None:
+        """Enables the possibility to hotswap LoRA adapters.
+
+        Calling this method is only required when hotswapping adapters and if the model is compiled or if the ranks of
+        the loaded adapters differ.
+
+        Args:
+            target_rank (`int`):
+                The highest rank among all the adapters that will be loaded.
+            check_compiled (`str`, *optional*, defaults to `"error"`):
+                How to handle the case when the model is already compiled, which should generally be avoided. The
+                options are:
+                  - "error" (default): raise an error
+                  - "warn": issue a warning
+                  - "ignore": do nothing
+        """
+        for key, component in self.components.items():
+            if hasattr(component, "enable_lora_hotswap") and (key in self._lora_loadable_modules):
+                component.enable_lora_hotswap(**kwargs)
@@ -17,7 +17,7 @@ from typing import List

 import torch

-from ..utils import is_peft_version, logging, state_dict_all_zero
+from ...utils import is_peft_version, logging, state_dict_all_zero


 logger = logging.get_logger(__name__)
@@ -433,7 +433,7 @@ def _convert_kohya_flux_lora_to_diffusers(state_dict):
        ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]
        if not is_sparse:
            # down_weight is copied to each split
-            ait_sd.update({k: down_weight for k in ait_down_keys})
+            ait_sd.update(dict.fromkeys(ait_down_keys, down_weight))

            # up_weight is split to each split
            ait_sd.update({k: v for k, v in zip(ait_up_keys, torch.split(up_weight, dims, dim=0))})  # noqa: C416
@@ -923,7 +923,7 @@ def _convert_xlabs_flux_lora_to_diffusers(old_state_dict):
        ait_up_keys = [k + ".lora_B.weight" for k in ait_keys]

        # down_weight is copied to each split
-        ait_sd.update({k: down_weight for k in ait_down_keys})
+        ait_sd.update(dict.fromkeys(ait_down_keys, down_weight))

        # up_weight is split to each split
        ait_sd.update({k: v for k, v in zip(ait_up_keys, torch.split(up_weight, dims, dim=0))})  # noqa: C416
@@ -12,924 +12,66 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import copy
-import inspect
-import os
-from pathlib import Path
-from typing import Callable, Dict, List, Optional, Union

-import safetensors
-import torch
-import torch.nn as nn
-from huggingface_hub import model_info
-from huggingface_hub.constants import HF_HUB_OFFLINE
-
-from ..models.modeling_utils import ModelMixin, load_state_dict
-from ..utils import (
-    USE_PEFT_BACKEND,
-    _get_model_file,
-    convert_state_dict_to_diffusers,
-    convert_state_dict_to_peft,
-    delete_adapter_layers,
-    deprecate,
-    get_adapter_name,
-    get_peft_kwargs,
-    is_accelerate_available,
-    is_peft_available,
-    is_peft_version,
-    is_transformers_available,
-    is_transformers_version,
-    logging,
-    recurse_remove_peft_layers,
-    scale_lora_layers,
-    set_adapter_layers,
-    set_weights_and_activate_adapters,
-)
-
-
-if is_transformers_available():
-    from transformers import PreTrainedModel
-
-    from ..models.lora import text_encoder_attn_modules, text_encoder_mlp_modules
-
-if is_peft_available():
-    from peft.tuners.tuners_utils import BaseTunerLayer
-
-if is_accelerate_available():
-    from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module
-
-logger = logging.get_logger(__name__)
-
-LORA_WEIGHT_NAME = "pytorch_lora_weights.bin"
-LORA_WEIGHT_NAME_SAFE = "pytorch_lora_weights.safetensors"
+from ..utils import deprecate
+from .lora.lora_base import LORA_WEIGHT_NAME, LORA_WEIGHT_NAME_SAFE, LoraBaseMixin  # noqa: F401


 def fuse_text_encoder_lora(text_encoder, lora_scale=1.0, safe_fusing=False, adapter_names=None):
-    """
-    Fuses LoRAs for the text encoder.
+    from .lora.lora_base import fuse_text_encoder_lora

-    Args:
-        text_encoder (`torch.nn.Module`):
-            The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder`
-            attribute.
-        lora_scale (`float`, defaults to 1.0):
-            Controls how much to influence the outputs with the LoRA parameters.
-        safe_fusing (`bool`, defaults to `False`):
-            Whether to check fused weights for NaN values before fusing and if values are NaN not fusing them.
-        adapter_names (`List[str]` or `str`):
-            The names of the adapters to use.
-    """
-    merge_kwargs = {"safe_merge": safe_fusing}
+    deprecation_message = "Importing `fuse_text_encoder_lora()` from diffusers.loaders.lora_base has been deprecated. Please use `from diffusers.loaders.lora.lora_base import fuse_text_encoder_lora` instead."
+    deprecate("diffusers.loaders.lora_base.fuse_text_encoder_lora", "0.36", deprecation_message)

-    for module in text_encoder.modules():
-        if isinstance(module, BaseTunerLayer):
-            if lora_scale != 1.0:
-                module.scale_layer(lora_scale)
-
-            # For BC with previous PEFT versions, we need to check the signature
-            # of the `merge` method to see if it supports the `adapter_names` argument.
-            supported_merge_kwargs = list(inspect.signature(module.merge).parameters)
-            if "adapter_names" in supported_merge_kwargs:
-                merge_kwargs["adapter_names"] = adapter_names
-            elif "adapter_names" not in supported_merge_kwargs and adapter_names is not None:
-                raise ValueError(
-                    "The `adapter_names` argument is not supported with your PEFT version. "
-                    "Please upgrade to the latest version of PEFT. `pip install -U peft`"
-                )
-
-            module.merge(**merge_kwargs)
+    return fuse_text_encoder_lora(
+        text_encoder, lora_scale=lora_scale, safe_fusing=safe_fusing, adapter_names=adapter_names
+    )


 def unfuse_text_encoder_lora(text_encoder):
-    """
-    Unfuses LoRAs for the text encoder.
+    from .lora.lora_base import unfuse_text_encoder_lora

-    Args:
-        text_encoder (`torch.nn.Module`):
-            The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder`
-            attribute.
-    """
-    for module in text_encoder.modules():
-        if isinstance(module, BaseTunerLayer):
-            module.unmerge()
+    deprecation_message = "Importing `unfuse_text_encoder_lora()` from diffusers.loaders.lora_base has been deprecated. Please use `from diffusers.loaders.lora.lora_base import unfuse_text_encoder_lora` instead."
+    deprecate("diffusers.loaders.lora_base.unfuse_text_encoder_lora", "0.36", deprecation_message)
+
+    return unfuse_text_encoder_lora(text_encoder)


 def set_adapters_for_text_encoder(
-    adapter_names: Union[List[str], str],
-    text_encoder: Optional["PreTrainedModel"] = None,  # noqa: F821
-    text_encoder_weights: Optional[Union[float, List[float], List[None]]] = None,
+    adapter_names,
+    text_encoder=None,
+    text_encoder_weights=None,
 ):
-    """
-    Sets the adapter layers for the text encoder.
+    from .lora.lora_base import set_adapters_for_text_encoder

-    Args:
-        adapter_names (`List[str]` or `str`):
-            The names of the adapters to use.
-        text_encoder (`torch.nn.Module`, *optional*):
-            The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder`
-            attribute.
-        text_encoder_weights (`List[float]`, *optional*):
-            The weights to use for the text encoder. If `None`, the weights are set to `1.0` for all the adapters.
-    """
-    if text_encoder is None:
-        raise ValueError(
-            "The pipeline does not have a default `pipe.text_encoder` class. Please make sure to pass a `text_encoder` instead."
-        )
+    deprecation_message = "Importing `set_adapters_for_text_encoder()` from diffusers.loaders.lora_base has been deprecated. Please use `from diffusers.loaders.lora.lora_base import set_adapters_for_text_encoder` instead."
+    deprecate("diffusers.loaders.lora_base.set_adapters_for_text_encoder", "0.36", deprecation_message)

-    def process_weights(adapter_names, weights):
-        # Expand weights into a list, one entry per adapter
-        # e.g. for 2 adapters:  7 -> [7,7] ; [3, None] -> [3, None]
-        if not isinstance(weights, list):
-            weights = [weights] * len(adapter_names)
-
-        if len(adapter_names) != len(weights):
-            raise ValueError(
-                f"Length of adapter names {len(adapter_names)} is not equal to the length of the weights {len(weights)}"
-            )
-
-        # Set None values to default of 1.0
-        # e.g. [7,7] -> [7,7] ; [3, None] -> [3,1]
-        weights = [w if w is not None else 1.0 for w in weights]
-
-        return weights
-
-    adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names
-    text_encoder_weights = process_weights(adapter_names, text_encoder_weights)
-    set_weights_and_activate_adapters(text_encoder, adapter_names, text_encoder_weights)
-
-
-def disable_lora_for_text_encoder(text_encoder: Optional["PreTrainedModel"] = None):
-    """
-    Disables the LoRA layers for the text encoder.
-
-    Args:
-        text_encoder (`torch.nn.Module`, *optional*):
-            The text encoder module to disable the LoRA layers for. If `None`, it will try to get the `text_encoder`
-            attribute.
-    """
-    if text_encoder is None:
-        raise ValueError("Text Encoder not found.")
-    set_adapter_layers(text_encoder, enabled=False)
-
-
-def enable_lora_for_text_encoder(text_encoder: Optional["PreTrainedModel"] = None):
-    """
-    Enables the LoRA layers for the text encoder.
-
-    Args:
-        text_encoder (`torch.nn.Module`, *optional*):
-            The text encoder module to enable the LoRA layers for. If `None`, it will try to get the `text_encoder`
-            attribute.
-    """
-    if text_encoder is None:
-        raise ValueError("Text Encoder not found.")
-    set_adapter_layers(text_encoder, enabled=True)
-
-
-def _remove_text_encoder_monkey_patch(text_encoder):
-    recurse_remove_peft_layers(text_encoder)
-    if getattr(text_encoder, "peft_config", None) is not None:
-        del text_encoder.peft_config
-        text_encoder._hf_peft_config_loaded = None
-
-
-def _fetch_state_dict(
-    pretrained_model_name_or_path_or_dict,
-    weight_name,
-    use_safetensors,
-    local_files_only,
-    cache_dir,
-    force_download,
-    proxies,
-    token,
-    revision,
-    subfolder,
-    user_agent,
-    allow_pickle,
-):
-    model_file = None
-    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-        # Let's first try to load .safetensors weights
-        if (use_safetensors and weight_name is None) or (
-            weight_name is not None and weight_name.endswith(".safetensors")
-        ):
-            try:
-                # Here we're relaxing the loading check to enable more Inference API
-                # friendliness where sometimes, it's not at all possible to automatically
-                # determine `weight_name`.
-                if weight_name is None:
-                    weight_name = _best_guess_weight_name(
-                        pretrained_model_name_or_path_or_dict,
-                        file_extension=".safetensors",
-                        local_files_only=local_files_only,
-                    )
-                model_file = _get_model_file(
-                    pretrained_model_name_or_path_or_dict,
-                    weights_name=weight_name or LORA_WEIGHT_NAME_SAFE,
-                    cache_dir=cache_dir,
-                    force_download=force_download,
-                    proxies=proxies,
-                    local_files_only=local_files_only,
-                    token=token,
-                    revision=revision,
-                    subfolder=subfolder,
-                    user_agent=user_agent,
-                )
-                state_dict = safetensors.torch.load_file(model_file, device="cpu")
-            except (IOError, safetensors.SafetensorError) as e:
-                if not allow_pickle:
-                    raise e
-                # try loading non-safetensors weights
-                model_file = None
-                pass
-
-        if model_file is None:
-            if weight_name is None:
-                weight_name = _best_guess_weight_name(
-                    pretrained_model_name_or_path_or_dict, file_extension=".bin", local_files_only=local_files_only
-                )
-            model_file = _get_model_file(
-                pretrained_model_name_or_path_or_dict,
-                weights_name=weight_name or LORA_WEIGHT_NAME,
-                cache_dir=cache_dir,
-                force_download=force_download,
-                proxies=proxies,
-                local_files_only=local_files_only,
-                token=token,
-                revision=revision,
-                subfolder=subfolder,
-                user_agent=user_agent,
-            )
-            state_dict = load_state_dict(model_file)
-    else:
-        state_dict = pretrained_model_name_or_path_or_dict
-
-    return state_dict
-
-
-def _best_guess_weight_name(
-    pretrained_model_name_or_path_or_dict, file_extension=".safetensors", local_files_only=False
-):
-    if local_files_only or HF_HUB_OFFLINE:
-        raise ValueError("When using the offline mode, you must specify a `weight_name`.")
-
-    targeted_files = []
-
-    if os.path.isfile(pretrained_model_name_or_path_or_dict):
-        return
-    elif os.path.isdir(pretrained_model_name_or_path_or_dict):
-        targeted_files = [f for f in os.listdir(pretrained_model_name_or_path_or_dict) if f.endswith(file_extension)]
-    else:
-        files_in_repo = model_info(pretrained_model_name_or_path_or_dict).siblings
-        targeted_files = [f.rfilename for f in files_in_repo if f.rfilename.endswith(file_extension)]
-    if len(targeted_files) == 0:
-        return
-
-    # "scheduler" does not correspond to a LoRA checkpoint.
-    # "optimizer" does not correspond to a LoRA checkpoint
-    # only top-level checkpoints are considered and not the other ones, hence "checkpoint".
-    unallowed_substrings = {"scheduler", "optimizer", "checkpoint"}
-    targeted_files = list(
-        filter(lambda x: all(substring not in x for substring in unallowed_substrings), targeted_files)
+    return set_adapters_for_text_encoder(
+        adapter_names=adapter_names, text_encoder=text_encoder, text_encoder_weights=text_encoder_weights
    )

-    if any(f.endswith(LORA_WEIGHT_NAME) for f in targeted_files):
-        targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME), targeted_files))
-    elif any(f.endswith(LORA_WEIGHT_NAME_SAFE) for f in targeted_files):
-        targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME_SAFE), targeted_files))

-    if len(targeted_files) > 1:
-        raise ValueError(
-            f"Provided path contains more than one weights file in the {file_extension} format. Either specify `weight_name` in `load_lora_weights` or make sure there's only one  `.safetensors` or `.bin` file in  {pretrained_model_name_or_path_or_dict}."
-        )
-    weight_name = targeted_files[0]
-    return weight_name
+def disable_lora_for_text_encoder(text_encoder=None):
+    from .lora.lora_base import disable_lora_for_text_encoder

+    deprecation_message = "Importing `disable_lora_for_text_encoder()` from diffusers.loaders.lora_base has been deprecated. Please use `from diffusers.loaders.lora.lora_base import disable_lora_for_text_encoder` instead."
+    deprecate("diffusers.loaders.lora_base.disable_lora_for_text_encoder", "0.36", deprecation_message)

-def _load_lora_into_text_encoder(
-    state_dict,
-    network_alphas,
-    text_encoder,
-    prefix=None,
-    lora_scale=1.0,
-    text_encoder_name="text_encoder",
-    adapter_name=None,
-    _pipeline=None,
-    low_cpu_mem_usage=False,
-    hotswap: bool = False,
-):
-    if not USE_PEFT_BACKEND:
-        raise ValueError("PEFT backend is required for this method.")
+    return disable_lora_for_text_encoder(text_encoder=text_encoder)

-    peft_kwargs = {}
-    if low_cpu_mem_usage:
-        if not is_peft_version(">=", "0.13.1"):
-            raise ValueError(
-                "`low_cpu_mem_usage=True` is not compatible with this `peft` version. Please update it with `pip install -U peft`."
-            )
-        if not is_transformers_version(">", "4.45.2"):
-            # Note from sayakpaul: It's not in `transformers` stable yet.
-            # https://github.com/huggingface/transformers/pull/33725/
-            raise ValueError(
-                "`low_cpu_mem_usage=True` is not compatible with this `transformers` version. Please update it with `pip install -U transformers`."
-            )
-        peft_kwargs["low_cpu_mem_usage"] = low_cpu_mem_usage

-    from peft import LoraConfig
+def enable_lora_for_text_encoder(text_encoder=None):
+    from .lora.lora_base import enable_lora_for_text_encoder

-    # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918),
-    # then the `state_dict` keys should have `unet_name` and/or `text_encoder_name` as
-    # their prefixes.
-    prefix = text_encoder_name if prefix is None else prefix
+    deprecation_message = "Importing `enable_lora_for_text_encoder()` from diffusers.loaders.lora_base has been deprecated. Please use `from diffusers.loaders.lora.lora_base import enable_lora_for_text_encoder` instead."
+    deprecate("diffusers.loaders.lora_base.enable_lora_for_text_encoder", "0.36", deprecation_message)

-    # Safe prefix to check with.
-    if hotswap and any(text_encoder_name in key for key in state_dict.keys()):
-        raise ValueError("At the moment, hotswapping is not supported for text encoders, please pass `hotswap=False`.")
+    return enable_lora_for_text_encoder(text_encoder=text_encoder)

-    # Load the layers corresponding to text encoder and make necessary adjustments.
-    if prefix is not None:
-        state_dict = {k[len(f"{prefix}.") :]: v for k, v in state_dict.items() if k.startswith(f"{prefix}.")}

-    if len(state_dict) > 0:
-        logger.info(f"Loading {prefix}.")
-        rank = {}
-        state_dict = convert_state_dict_to_diffusers(state_dict)
-
-        # convert state dict
-        state_dict = convert_state_dict_to_peft(state_dict)
-
-        for name, _ in text_encoder_attn_modules(text_encoder):
-            for module in ("out_proj", "q_proj", "k_proj", "v_proj"):
-                rank_key = f"{name}.{module}.lora_B.weight"
-                if rank_key not in state_dict:
-                    continue
-                rank[rank_key] = state_dict[rank_key].shape[1]
-
-        for name, _ in text_encoder_mlp_modules(text_encoder):
-            for module in ("fc1", "fc2"):
-                rank_key = f"{name}.{module}.lora_B.weight"
-                if rank_key not in state_dict:
-                    continue
-                rank[rank_key] = state_dict[rank_key].shape[1]
-
-        if network_alphas is not None:
-            alpha_keys = [k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix]
-            network_alphas = {k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys}
-
-        lora_config_kwargs = get_peft_kwargs(rank, network_alphas, state_dict, is_unet=False)
-
-        if "use_dora" in lora_config_kwargs:
-            if lora_config_kwargs["use_dora"]:
-                if is_peft_version("<", "0.9.0"):
-                    raise ValueError(
-                        "You need `peft` 0.9.0 at least to use DoRA-enabled LoRAs. Please upgrade your installation of `peft`."
-                    )
-            else:
-                if is_peft_version("<", "0.9.0"):
-                    lora_config_kwargs.pop("use_dora")
-
-        if "lora_bias" in lora_config_kwargs:
-            if lora_config_kwargs["lora_bias"]:
-                if is_peft_version("<=", "0.13.2"):
-                    raise ValueError(
-                        "You need `peft` 0.14.0 at least to use `bias` in LoRAs. Please upgrade your installation of `peft`."
-                    )
-            else:
-                if is_peft_version("<=", "0.13.2"):
-                    lora_config_kwargs.pop("lora_bias")
-
-        lora_config = LoraConfig(**lora_config_kwargs)
-
-        # adapter_name
-        if adapter_name is None:
-            adapter_name = get_adapter_name(text_encoder)
-
-        is_model_cpu_offload, is_sequential_cpu_offload = _func_optionally_disable_offloading(_pipeline)
-
-        # inject LoRA layers and load the state dict
-        # in transformers we automatically check whether the adapter name is already in use or not
-        text_encoder.load_adapter(
-            adapter_name=adapter_name,
-            adapter_state_dict=state_dict,
-            peft_config=lora_config,
-            **peft_kwargs,
-        )
-
-        # scale LoRA layers with `lora_scale`
-        scale_lora_layers(text_encoder, weight=lora_scale)
-
-        text_encoder.to(device=text_encoder.device, dtype=text_encoder.dtype)
-
-        # Offload back.
-        if is_model_cpu_offload:
-            _pipeline.enable_model_cpu_offload()
-        elif is_sequential_cpu_offload:
-            _pipeline.enable_sequential_cpu_offload()
-        # Unsafe code />
-
-    if prefix is not None and not state_dict:
-        logger.warning(
-            f"No LoRA keys associated to {text_encoder.__class__.__name__} found with the {prefix=}. "
-            "This is safe to ignore if LoRA state dict didn't originally have any "
-            f"{text_encoder.__class__.__name__} related params. You can also try specifying `prefix=None` "
-            "to resolve the warning. Otherwise, open an issue if you think it's unexpected: "
-            "https://github.com/huggingface/diffusers/issues/new"
-        )
-
-
-def _func_optionally_disable_offloading(_pipeline):
-    is_model_cpu_offload = False
-    is_sequential_cpu_offload = False
-
-    if _pipeline is not None and _pipeline.hf_device_map is None:
-        for _, component in _pipeline.components.items():
-            if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
-                if not is_model_cpu_offload:
-                    is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
-                if not is_sequential_cpu_offload:
-                    is_sequential_cpu_offload = (
-                        isinstance(component._hf_hook, AlignDevicesHook)
-                        or hasattr(component._hf_hook, "hooks")
-                        and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
-                    )
-
-                logger.info(
-                    "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again."
-                )
-                remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
-
-    return (is_model_cpu_offload, is_sequential_cpu_offload)
-
-
-class LoraBaseMixin:
-    """Utility class for handling LoRAs."""
-
-    _lora_loadable_modules = []
-    num_fused_loras = 0
-
-    def load_lora_weights(self, **kwargs):
-        raise NotImplementedError("`load_lora_weights()` is not implemented.")
-
-    @classmethod
-    def save_lora_weights(cls, **kwargs):
-        raise NotImplementedError("`save_lora_weights()` not implemented.")
-
-    @classmethod
-    def lora_state_dict(cls, **kwargs):
-        raise NotImplementedError("`lora_state_dict()` is not implemented.")
-
-    @classmethod
-    def _optionally_disable_offloading(cls, _pipeline):
-        """
-        Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU.
-
-        Args:
-            _pipeline (`DiffusionPipeline`):
-                The pipeline to disable offloading for.
-
-        Returns:
-            tuple:
-                A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
-        """
-        return _func_optionally_disable_offloading(_pipeline=_pipeline)
-
-    @classmethod
-    def _fetch_state_dict(cls, *args, **kwargs):
-        deprecation_message = f"Using the `_fetch_state_dict()` method from {cls} has been deprecated and will be removed in a future version. Please use `from diffusers.loaders.lora_base import _fetch_state_dict`."
-        deprecate("_fetch_state_dict", "0.35.0", deprecation_message)
-        return _fetch_state_dict(*args, **kwargs)
-
-    @classmethod
-    def _best_guess_weight_name(cls, *args, **kwargs):
-        deprecation_message = f"Using the `_best_guess_weight_name()` method from {cls} has been deprecated and will be removed in a future version. Please use `from diffusers.loaders.lora_base import _best_guess_weight_name`."
-        deprecate("_best_guess_weight_name", "0.35.0", deprecation_message)
-        return _best_guess_weight_name(*args, **kwargs)
-
-    def unload_lora_weights(self):
-        """
-        Unloads the LoRA parameters.
-
-        Examples:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the LoRA parameters.
-        >>> pipeline.unload_lora_weights()
-        >>> ...
-        ```
-        """
-        if not USE_PEFT_BACKEND:
-            raise ValueError("PEFT backend is required for this method.")
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if model is not None:
-                if issubclass(model.__class__, ModelMixin):
-                    model.unload_lora()
-                elif issubclass(model.__class__, PreTrainedModel):
-                    _remove_text_encoder_monkey_patch(model)
-
-    def fuse_lora(
-        self,
-        components: List[str] = [],
-        lora_scale: float = 1.0,
-        safe_fusing: bool = False,
-        adapter_names: Optional[List[str]] = None,
-        **kwargs,
-    ):
-        r"""
-        Fuses the LoRA parameters into the original parameters of the corresponding blocks.
-
-        <Tip warning={true}>
-
-        This is an experimental API.
-
-        </Tip>
-
-        Args:
-            components: (`List[str]`): List of LoRA-injectable components to fuse the LoRAs into.
-            lora_scale (`float`, defaults to 1.0):
-                Controls how much to influence the outputs with the LoRA parameters.
-            safe_fusing (`bool`, defaults to `False`):
-                Whether to check fused weights for NaN values before fusing and if values are NaN not fusing them.
-            adapter_names (`List[str]`, *optional*):
-                Adapter names to be used for fusing. If nothing is passed, all active adapters will be fused.
-
-        Example:
-
-        ```py
-        from diffusers import DiffusionPipeline
-        import torch
-
-        pipeline = DiffusionPipeline.from_pretrained(
-            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
-        ).to("cuda")
-        pipeline.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")
-        pipeline.fuse_lora(lora_scale=0.7)
-        ```
-        """
-        if "fuse_unet" in kwargs:
-            depr_message = "Passing `fuse_unet` to `fuse_lora()` is deprecated and will be ignored. Please use the `components` argument and provide a list of the components whose LoRAs are to be fused. `fuse_unet` will be removed in a future version."
-            deprecate(
-                "fuse_unet",
-                "1.0.0",
-                depr_message,
-            )
-        if "fuse_transformer" in kwargs:
-            depr_message = "Passing `fuse_transformer` to `fuse_lora()` is deprecated and will be ignored. Please use the `components` argument and provide a list of the components whose LoRAs are to be fused. `fuse_transformer` will be removed in a future version."
-            deprecate(
-                "fuse_transformer",
-                "1.0.0",
-                depr_message,
-            )
-        if "fuse_text_encoder" in kwargs:
-            depr_message = "Passing `fuse_text_encoder` to `fuse_lora()` is deprecated and will be ignored. Please use the `components` argument and provide a list of the components whose LoRAs are to be fused. `fuse_text_encoder` will be removed in a future version."
-            deprecate(
-                "fuse_text_encoder",
-                "1.0.0",
-                depr_message,
-            )
-
-        if len(components) == 0:
-            raise ValueError("`components` cannot be an empty list.")
-
-        for fuse_component in components:
-            if fuse_component not in self._lora_loadable_modules:
-                raise ValueError(f"{fuse_component} is not found in {self._lora_loadable_modules=}.")
-
-            model = getattr(self, fuse_component, None)
-            if model is not None:
-                # check if diffusers model
-                if issubclass(model.__class__, ModelMixin):
-                    model.fuse_lora(lora_scale, safe_fusing=safe_fusing, adapter_names=adapter_names)
-                # handle transformers models.
-                if issubclass(model.__class__, PreTrainedModel):
-                    fuse_text_encoder_lora(
-                        model, lora_scale=lora_scale, safe_fusing=safe_fusing, adapter_names=adapter_names
-                    )
-
-        self.num_fused_loras += 1
-
-    def unfuse_lora(self, components: List[str] = [], **kwargs):
-        r"""
-        Reverses the effect of
-        [`pipe.fuse_lora()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraBaseMixin.fuse_lora).
-
-        <Tip warning={true}>
-
-        This is an experimental API.
-
-        </Tip>
-
-        Args:
-            components (`List[str]`): List of LoRA-injectable components to unfuse LoRA from.
-            unfuse_unet (`bool`, defaults to `True`): Whether to unfuse the UNet LoRA parameters.
-            unfuse_text_encoder (`bool`, defaults to `True`):
-                Whether to unfuse the text encoder LoRA parameters. If the text encoder wasn't monkey-patched with the
-                LoRA parameters then it won't have any effect.
-        """
-        if "unfuse_unet" in kwargs:
-            depr_message = "Passing `unfuse_unet` to `unfuse_lora()` is deprecated and will be ignored. Please use the `components` argument. `unfuse_unet` will be removed in a future version."
-            deprecate(
-                "unfuse_unet",
-                "1.0.0",
-                depr_message,
-            )
-        if "unfuse_transformer" in kwargs:
-            depr_message = "Passing `unfuse_transformer` to `unfuse_lora()` is deprecated and will be ignored. Please use the `components` argument. `unfuse_transformer` will be removed in a future version."
-            deprecate(
-                "unfuse_transformer",
-                "1.0.0",
-                depr_message,
-            )
-        if "unfuse_text_encoder" in kwargs:
-            depr_message = "Passing `unfuse_text_encoder` to `unfuse_lora()` is deprecated and will be ignored. Please use the `components` argument. `unfuse_text_encoder` will be removed in a future version."
-            deprecate(
-                "unfuse_text_encoder",
-                "1.0.0",
-                depr_message,
-            )
-
-        if len(components) == 0:
-            raise ValueError("`components` cannot be an empty list.")
-
-        for fuse_component in components:
-            if fuse_component not in self._lora_loadable_modules:
-                raise ValueError(f"{fuse_component} is not found in {self._lora_loadable_modules=}.")
-
-            model = getattr(self, fuse_component, None)
-            if model is not None:
-                if issubclass(model.__class__, (ModelMixin, PreTrainedModel)):
-                    for module in model.modules():
-                        if isinstance(module, BaseTunerLayer):
-                            module.unmerge()
-
-        self.num_fused_loras -= 1
-
-    def set_adapters(
-        self,
-        adapter_names: Union[List[str], str],
-        adapter_weights: Optional[Union[float, Dict, List[float], List[Dict]]] = None,
-    ):
-        if isinstance(adapter_weights, dict):
-            components_passed = set(adapter_weights.keys())
-            lora_components = set(self._lora_loadable_modules)
-
-            invalid_components = sorted(components_passed - lora_components)
-            if invalid_components:
-                logger.warning(
-                    f"The following components in `adapter_weights` are not part of the pipeline: {invalid_components}. "
-                    f"Available components that are LoRA-compatible: {self._lora_loadable_modules}. So, weights belonging "
-                    "to the invalid components will be removed and ignored."
-                )
-                adapter_weights = {k: v for k, v in adapter_weights.items() if k not in invalid_components}
-
-        adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names
-        adapter_weights = copy.deepcopy(adapter_weights)
-
-        # Expand weights into a list, one entry per adapter
-        if not isinstance(adapter_weights, list):
-            adapter_weights = [adapter_weights] * len(adapter_names)
-
-        if len(adapter_names) != len(adapter_weights):
-            raise ValueError(
-                f"Length of adapter names {len(adapter_names)} is not equal to the length of the weights {len(adapter_weights)}"
-            )
-
-        list_adapters = self.get_list_adapters()  # eg {"unet": ["adapter1", "adapter2"], "text_encoder": ["adapter2"]}
-        # eg ["adapter1", "adapter2"]
-        all_adapters = {adapter for adapters in list_adapters.values() for adapter in adapters}
-        missing_adapters = set(adapter_names) - all_adapters
-        if len(missing_adapters) > 0:
-            raise ValueError(
-                f"Adapter name(s) {missing_adapters} not in the list of present adapters: {all_adapters}."
-            )
-
-        # eg {"adapter1": ["unet"], "adapter2": ["unet", "text_encoder"]}
-        invert_list_adapters = {
-            adapter: [part for part, adapters in list_adapters.items() if adapter in adapters]
-            for adapter in all_adapters
-        }
-
-        # Decompose weights into weights for denoiser and text encoders.
-        _component_adapter_weights = {}
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component)
-
-            for adapter_name, weights in zip(adapter_names, adapter_weights):
-                if isinstance(weights, dict):
-                    component_adapter_weights = weights.pop(component, None)
-                    if component_adapter_weights is not None and component not in invert_list_adapters[adapter_name]:
-                        logger.warning(
-                            (
-                                f"Lora weight dict for adapter '{adapter_name}' contains {component},"
-                                f"but this will be ignored because {adapter_name} does not contain weights for {component}."
-                                f"Valid parts for {adapter_name} are: {invert_list_adapters[adapter_name]}."
-                            )
-                        )
-
-                else:
-                    component_adapter_weights = weights
-
-                _component_adapter_weights.setdefault(component, [])
-                _component_adapter_weights[component].append(component_adapter_weights)
-
-            if issubclass(model.__class__, ModelMixin):
-                model.set_adapters(adapter_names, _component_adapter_weights[component])
-            elif issubclass(model.__class__, PreTrainedModel):
-                set_adapters_for_text_encoder(adapter_names, model, _component_adapter_weights[component])
-
-    def disable_lora(self):
-        if not USE_PEFT_BACKEND:
-            raise ValueError("PEFT backend is required for this method.")
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if model is not None:
-                if issubclass(model.__class__, ModelMixin):
-                    model.disable_lora()
-                elif issubclass(model.__class__, PreTrainedModel):
-                    disable_lora_for_text_encoder(model)
-
-    def enable_lora(self):
-        if not USE_PEFT_BACKEND:
-            raise ValueError("PEFT backend is required for this method.")
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if model is not None:
-                if issubclass(model.__class__, ModelMixin):
-                    model.enable_lora()
-                elif issubclass(model.__class__, PreTrainedModel):
-                    enable_lora_for_text_encoder(model)
-
-    def delete_adapters(self, adapter_names: Union[List[str], str]):
-        """
-        Args:
-        Deletes the LoRA layers of `adapter_name` for the unet and text-encoder(s).
-            adapter_names (`Union[List[str], str]`):
-                The names of the adapter to delete. Can be a single string or a list of strings
-        """
-        if not USE_PEFT_BACKEND:
-            raise ValueError("PEFT backend is required for this method.")
-
-        if isinstance(adapter_names, str):
-            adapter_names = [adapter_names]
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if model is not None:
-                if issubclass(model.__class__, ModelMixin):
-                    model.delete_adapters(adapter_names)
-                elif issubclass(model.__class__, PreTrainedModel):
-                    for adapter_name in adapter_names:
-                        delete_adapter_layers(model, adapter_name)
-
-    def get_active_adapters(self) -> List[str]:
-        """
-        Gets the list of the current active adapters.
-
-        Example:
-
-        ```python
-        from diffusers import DiffusionPipeline
-
-        pipeline = DiffusionPipeline.from_pretrained(
-            "stabilityai/stable-diffusion-xl-base-1.0",
-        ).to("cuda")
-        pipeline.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
-        pipeline.get_active_adapters()
-        ```
-        """
-        if not USE_PEFT_BACKEND:
-            raise ValueError(
-                "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`"
-            )
-
-        active_adapters = []
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if model is not None and issubclass(model.__class__, ModelMixin):
-                for module in model.modules():
-                    if isinstance(module, BaseTunerLayer):
-                        active_adapters = module.active_adapters
-                        break
-
-        return active_adapters
-
-    def get_list_adapters(self) -> Dict[str, List[str]]:
-        """
-        Gets the current list of all available adapters in the pipeline.
-        """
-        if not USE_PEFT_BACKEND:
-            raise ValueError(
-                "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`"
-            )
-
-        set_adapters = {}
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if (
-                model is not None
-                and issubclass(model.__class__, (ModelMixin, PreTrainedModel))
-                and hasattr(model, "peft_config")
-            ):
-                set_adapters[component] = list(model.peft_config.keys())
-
-        return set_adapters
-
-    def set_lora_device(self, adapter_names: List[str], device: Union[torch.device, str, int]) -> None:
-        """
-        Moves the LoRAs listed in `adapter_names` to a target device. Useful for offloading the LoRA to the CPU in case
-        you want to load multiple adapters and free some GPU memory.
-
-        Args:
-            adapter_names (`List[str]`):
-                List of adapters to send device to.
-            device (`Union[torch.device, str, int]`):
-                Device to send the adapters to. Can be either a torch device, a str or an integer.
-        """
-        if not USE_PEFT_BACKEND:
-            raise ValueError("PEFT backend is required for this method.")
-
-        for component in self._lora_loadable_modules:
-            model = getattr(self, component, None)
-            if model is not None:
-                for module in model.modules():
-                    if isinstance(module, BaseTunerLayer):
-                        for adapter_name in adapter_names:
-                            module.lora_A[adapter_name].to(device)
-                            module.lora_B[adapter_name].to(device)
-                            # this is a param, not a module, so device placement is not in-place -> re-assign
-                            if hasattr(module, "lora_magnitude_vector") and module.lora_magnitude_vector is not None:
-                                if adapter_name in module.lora_magnitude_vector:
-                                    module.lora_magnitude_vector[adapter_name] = module.lora_magnitude_vector[
-                                        adapter_name
-                                    ].to(device)
-
-    @staticmethod
-    def pack_weights(layers, prefix):
-        layers_weights = layers.state_dict() if isinstance(layers, torch.nn.Module) else layers
-        layers_state_dict = {f"{prefix}.{module_name}": param for module_name, param in layers_weights.items()}
-        return layers_state_dict
-
-    @staticmethod
-    def write_lora_layers(
-        state_dict: Dict[str, torch.Tensor],
-        save_directory: str,
-        is_main_process: bool,
-        weight_name: str,
-        save_function: Callable,
-        safe_serialization: bool,
-    ):
-        if os.path.isfile(save_directory):
-            logger.error(f"Provided path ({save_directory}) should be a directory, not a file")
-            return
-
-        if save_function is None:
-            if safe_serialization:
-
-                def save_function(weights, filename):
-                    return safetensors.torch.save_file(weights, filename, metadata={"format": "pt"})
-
-            else:
-                save_function = torch.save
-
-        os.makedirs(save_directory, exist_ok=True)
-
-        if weight_name is None:
-            if safe_serialization:
-                weight_name = LORA_WEIGHT_NAME_SAFE
-            else:
-                weight_name = LORA_WEIGHT_NAME
-
-        save_path = Path(save_directory, weight_name).as_posix()
-        save_function(state_dict, save_path)
-        logger.info(f"Model weights saved in {save_path}")
-
-    @property
-    def lora_scale(self) -> float:
-        # property function that returns the lora scale which can be set at run time by the pipeline.
-        # if _lora_scale has not been set, return 1
-        return self._lora_scale if hasattr(self, "_lora_scale") else 1.0
-
-    def enable_lora_hotswap(self, **kwargs) -> None:
-        """Enables the possibility to hotswap LoRA adapters.
-
-        Calling this method is only required when hotswapping adapters and if the model is compiled or if the ranks of
-        the loaded adapters differ.
-
-        Args:
-            target_rank (`int`):
-                The highest rank among all the adapters that will be loaded.
-            check_compiled (`str`, *optional*, defaults to `"error"`):
-                How to handle the case when the model is already compiled, which should generally be avoided. The
-                options are:
-                  - "error" (default): raise an error
-                  - "warn": issue a warning
-                  - "ignore": do nothing
-        """
-        for key, component in self.components.items():
-            if hasattr(component, "enable_lora_hotswap") and (key in self._lora_loadable_modules):
-                component.enable_lora_hotswap(**kwargs)
+class LoraBaseMixin(LoraBaseMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `LoraBaseMixin` from diffusers.loaders.lora_base has been deprecated. Please use `from diffusers.loaders.lora.lora_base import LoraBaseMixin` instead."
+        deprecate("diffusers.loaders.lora_base.LoraBaseMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)
@@ -35,8 +35,8 @@ from ..utils import (
    set_adapter_layers,
    set_weights_and_activate_adapters,
 )
-from .lora_base import _fetch_state_dict, _func_optionally_disable_offloading
-from .unet_loader_utils import _maybe_expand_lora_scales
+from .lora.lora_base import _fetch_state_dict, _func_optionally_disable_offloading
+from .unet.unet_loader_utils import _maybe_expand_lora_scales


 logger = logging.get_logger(__name__)
@@ -56,6 +56,7 @@ _SET_ADAPTER_SCALE_FN_MAPPING = {
    "Lumina2Transformer2DModel": lambda model_cls, weights: weights,
    "WanTransformer3DModel": lambda model_cls, weights: weights,
    "CogView4Transformer2DModel": lambda model_cls, weights: weights,
+    "HiDreamImageTransformer2DModel": lambda model_cls, weights: weights,
 }


@@ -98,7 +99,7 @@ class PeftAdapterMixin:
    _prepare_lora_hotswap_kwargs: Optional[dict] = None

    @classmethod
-    # Copied from diffusers.loaders.lora_base.LoraBaseMixin._optionally_disable_offloading
+    # Copied from diffusers.loaders.lora.lora_base.LoraBaseMixin._optionally_disable_offloading
    def _optionally_disable_offloading(cls, _pipeline):
        """
        Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU.
@@ -11,42 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import importlib
-import inspect
-import os
-
-import torch
-from huggingface_hub import snapshot_download
-from huggingface_hub.utils import LocalEntryNotFoundError, validate_hf_hub_args
-from packaging import version
-from typing_extensions import Self
-
-from ..utils import deprecate, is_transformers_available, logging
-from .single_file_utils import (
-    SingleFileComponentError,
-    _is_legacy_scheduler_kwargs,
-    _is_model_weights_in_cached_folder,
-    _legacy_load_clip_tokenizer,
-    _legacy_load_safety_checker,
-    _legacy_load_scheduler,
-    create_diffusers_clip_model_from_ldm,
-    create_diffusers_t5_model_from_checkpoint,
-    fetch_diffusers_config,
-    fetch_original_config,
-    is_clip_model_in_single_file,
-    is_t5_in_single_file,
-    load_single_file_checkpoint,
-)
-
-
-logger = logging.get_logger(__name__)
-
-# Legacy behaviour. `from_single_file` does not load the safety checker unless explicitly provided
-SINGLE_FILE_OPTIONAL_COMPONENTS = ["safety_checker"]
-
-if is_transformers_available():
-    import transformers
-    from transformers import PreTrainedModel, PreTrainedTokenizer
+from ..utils import deprecate
+from .single_file.single_file import FromSingleFileMixin


 def load_single_file_sub_model(
@@ -64,502 +30,30 @@ def load_single_file_sub_model(
    disable_mmap=False,
    **kwargs,
 ):
-    if is_pipeline_module:
-        pipeline_module = getattr(pipelines, library_name)
-        class_obj = getattr(pipeline_module, class_name)
-    else:
-        # else we just import it from the library.
-        library = importlib.import_module(library_name)
-        class_obj = getattr(library, class_name)
+    from .single_file.single_file import load_single_file_sub_model

-    if is_transformers_available():
-        transformers_version = version.parse(version.parse(transformers.__version__).base_version)
-    else:
-        transformers_version = "N/A"
+    deprecation_message = "Importing `load_single_file_sub_model()` from diffusers.loaders.single_file has been deprecated. Please use `from diffusers.loaders.single_file.single_file import load_single_file_sub_model` instead."
+    deprecate("diffusers.loaders.single_file.load_single_file_sub_model", "0.36", deprecation_message)

-    is_transformers_model = (
-        is_transformers_available()
-        and issubclass(class_obj, PreTrainedModel)
-        and transformers_version >= version.parse("4.20.0")
-    )
-    is_tokenizer = (
-        is_transformers_available()
-        and issubclass(class_obj, PreTrainedTokenizer)
-        and transformers_version >= version.parse("4.20.0")
+    return load_single_file_sub_model(
+        library_name,
+        class_name,
+        name,
+        checkpoint,
+        pipelines,
+        is_pipeline_module,
+        cached_model_config_path,
+        original_config,
+        local_files_only,
+        torch_dtype,
+        is_legacy_loading,
+        disable_mmap,
+        **kwargs,
    )

-    diffusers_module = importlib.import_module(__name__.split(".")[0])
-    is_diffusers_single_file_model = issubclass(class_obj, diffusers_module.FromOriginalModelMixin)
-    is_diffusers_model = issubclass(class_obj, diffusers_module.ModelMixin)
-    is_diffusers_scheduler = issubclass(class_obj, diffusers_module.SchedulerMixin)

-    if is_diffusers_single_file_model:
-        load_method = getattr(class_obj, "from_single_file")
-
-        # We cannot provide two different config options to the `from_single_file` method
-        # Here we have to ignore loading the config from `cached_model_config_path` if `original_config` is provided
-        if original_config:
-            cached_model_config_path = None
-
-        loaded_sub_model = load_method(
-            pretrained_model_link_or_path_or_dict=checkpoint,
-            original_config=original_config,
-            config=cached_model_config_path,
-            subfolder=name,
-            torch_dtype=torch_dtype,
-            local_files_only=local_files_only,
-            disable_mmap=disable_mmap,
-            **kwargs,
-        )
-
-    elif is_transformers_model and is_clip_model_in_single_file(class_obj, checkpoint):
-        loaded_sub_model = create_diffusers_clip_model_from_ldm(
-            class_obj,
-            checkpoint=checkpoint,
-            config=cached_model_config_path,
-            subfolder=name,
-            torch_dtype=torch_dtype,
-            local_files_only=local_files_only,
-            is_legacy_loading=is_legacy_loading,
-        )
-
-    elif is_transformers_model and is_t5_in_single_file(checkpoint):
-        loaded_sub_model = create_diffusers_t5_model_from_checkpoint(
-            class_obj,
-            checkpoint=checkpoint,
-            config=cached_model_config_path,
-            subfolder=name,
-            torch_dtype=torch_dtype,
-            local_files_only=local_files_only,
-        )
-
-    elif is_tokenizer and is_legacy_loading:
-        loaded_sub_model = _legacy_load_clip_tokenizer(
-            class_obj, checkpoint=checkpoint, config=cached_model_config_path, local_files_only=local_files_only
-        )
-
-    elif is_diffusers_scheduler and (is_legacy_loading or _is_legacy_scheduler_kwargs(kwargs)):
-        loaded_sub_model = _legacy_load_scheduler(
-            class_obj, checkpoint=checkpoint, component_name=name, original_config=original_config, **kwargs
-        )
-
-    else:
-        if not hasattr(class_obj, "from_pretrained"):
-            raise ValueError(
-                (
-                    f"The component {class_obj.__name__} cannot be loaded as it does not seem to have"
-                    " a supported loading method."
-                )
-            )
-
-        loading_kwargs = {}
-        loading_kwargs.update(
-            {
-                "pretrained_model_name_or_path": cached_model_config_path,
-                "subfolder": name,
-                "local_files_only": local_files_only,
-            }
-        )
-
-        # Schedulers and Tokenizers don't make use of torch_dtype
-        # Skip passing it to those objects
-        if issubclass(class_obj, torch.nn.Module):
-            loading_kwargs.update({"torch_dtype": torch_dtype})
-
-        if is_diffusers_model or is_transformers_model:
-            if not _is_model_weights_in_cached_folder(cached_model_config_path, name):
-                raise SingleFileComponentError(
-                    f"Failed to load {class_name}. Weights for this component appear to be missing in the checkpoint."
-                )
-
-        load_method = getattr(class_obj, "from_pretrained")
-        loaded_sub_model = load_method(**loading_kwargs)
-
-    return loaded_sub_model
-
-
-def _map_component_types_to_config_dict(component_types):
-    diffusers_module = importlib.import_module(__name__.split(".")[0])
-    config_dict = {}
-    component_types.pop("self", None)
-
-    if is_transformers_available():
-        transformers_version = version.parse(version.parse(transformers.__version__).base_version)
-    else:
-        transformers_version = "N/A"
-
-    for component_name, component_value in component_types.items():
-        is_diffusers_model = issubclass(component_value[0], diffusers_module.ModelMixin)
-        is_scheduler_enum = component_value[0].__name__ == "KarrasDiffusionSchedulers"
-        is_scheduler = issubclass(component_value[0], diffusers_module.SchedulerMixin)
-
-        is_transformers_model = (
-            is_transformers_available()
-            and issubclass(component_value[0], PreTrainedModel)
-            and transformers_version >= version.parse("4.20.0")
-        )
-        is_transformers_tokenizer = (
-            is_transformers_available()
-            and issubclass(component_value[0], PreTrainedTokenizer)
-            and transformers_version >= version.parse("4.20.0")
-        )
-
-        if is_diffusers_model and component_name not in SINGLE_FILE_OPTIONAL_COMPONENTS:
-            config_dict[component_name] = ["diffusers", component_value[0].__name__]
-
-        elif is_scheduler_enum or is_scheduler:
-            if is_scheduler_enum:
-                # Since we cannot fetch a scheduler config from the hub, we default to DDIMScheduler
-                # if the type hint is a KarrassDiffusionSchedulers enum
-                config_dict[component_name] = ["diffusers", "DDIMScheduler"]
-
-            elif is_scheduler:
-                config_dict[component_name] = ["diffusers", component_value[0].__name__]
-
-        elif (
-            is_transformers_model or is_transformers_tokenizer
-        ) and component_name not in SINGLE_FILE_OPTIONAL_COMPONENTS:
-            config_dict[component_name] = ["transformers", component_value[0].__name__]
-
-        else:
-            config_dict[component_name] = [None, None]
-
-    return config_dict
-
-
-def _infer_pipeline_config_dict(pipeline_class):
-    parameters = inspect.signature(pipeline_class.__init__).parameters
-    required_parameters = {k: v for k, v in parameters.items() if v.default == inspect._empty}
-    component_types = pipeline_class._get_signature_types()
-
-    # Ignore parameters that are not required for the pipeline
-    component_types = {k: v for k, v in component_types.items() if k in required_parameters}
-    config_dict = _map_component_types_to_config_dict(component_types)
-
-    return config_dict
-
-
-def _download_diffusers_model_config_from_hub(
-    pretrained_model_name_or_path,
-    cache_dir,
-    revision,
-    proxies,
-    force_download=None,
-    local_files_only=None,
-    token=None,
-):
-    allow_patterns = ["**/*.json", "*.json", "*.txt", "**/*.txt", "**/*.model"]
-    cached_model_path = snapshot_download(
-        pretrained_model_name_or_path,
-        cache_dir=cache_dir,
-        revision=revision,
-        proxies=proxies,
-        force_download=force_download,
-        local_files_only=local_files_only,
-        token=token,
-        allow_patterns=allow_patterns,
-    )
-
-    return cached_model_path
-
-
-class FromSingleFileMixin:
-    """
-    Load model weights saved in the `.ckpt` format into a [`DiffusionPipeline`].
-    """
-
-    @classmethod
-    @validate_hf_hub_args
-    def from_single_file(cls, pretrained_model_link_or_path, **kwargs) -> Self:
-        r"""
-        Instantiate a [`DiffusionPipeline`] from pretrained pipeline weights saved in the `.ckpt` or `.safetensors`
-        format. The pipeline is set in evaluation mode (`model.eval()`) by default.
-
-        Parameters:
-            pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*):
-                Can be either:
-                    - A link to the `.ckpt` file (for example
-                      `"https://huggingface.co/<repo_id>/blob/main/<path_to_file>.ckpt"`) on the Hub.
-                    - A path to a *file* containing all pipeline weights.
-            torch_dtype (`str` or `torch.dtype`, *optional*):
-                Override the default `torch.dtype` and load the model with another dtype.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to `True`, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            original_config_file (`str`, *optional*):
-                The path to the original config file that was used to train the model. If not provided, the config file
-                will be inferred from the checkpoint file.
-            config (`str`, *optional*):
-                Can be either:
-                    - A string, the *repo id* (for example `CompVis/ldm-text2im-large-256`) of a pretrained pipeline
-                      hosted on the Hub.
-                    - A path to a *directory* (for example `./my_pipeline_directory/`) containing the pipeline
-                      component configs in Diffusers format.
-            disable_mmap ('bool', *optional*, defaults to 'False'):
-                Whether to disable mmap when loading a Safetensors model. This option can perform better when the model
-                is on a network mount or hard drive.
-            kwargs (remaining dictionary of keyword arguments, *optional*):
-                Can be used to overwrite load and saveable variables (the pipeline components of the specific pipeline
-                class). The overwritten components are passed directly to the pipelines `__init__` method. See example
-                below for more information.
-
-        Examples:
-
-        ```py
-        >>> from diffusers import StableDiffusionPipeline
-
-        >>> # Download pipeline from huggingface.co and cache.
-        >>> pipeline = StableDiffusionPipeline.from_single_file(
-        ...     "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors"
-        ... )
-
-        >>> # Download pipeline from local file
-        >>> # file is downloaded under ./v1-5-pruned-emaonly.ckpt
-        >>> pipeline = StableDiffusionPipeline.from_single_file("./v1-5-pruned-emaonly.ckpt")
-
-        >>> # Enable float16 and move to GPU
-        >>> pipeline = StableDiffusionPipeline.from_single_file(
-        ...     "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt",
-        ...     torch_dtype=torch.float16,
-        ... )
-        >>> pipeline.to("cuda")
-        ```
-
-        """
-        original_config_file = kwargs.pop("original_config_file", None)
-        config = kwargs.pop("config", None)
-        original_config = kwargs.pop("original_config", None)
-
-        if original_config_file is not None:
-            deprecation_message = (
-                "`original_config_file` argument is deprecated and will be removed in future versions."
-                "please use the `original_config` argument instead."
-            )
-            deprecate("original_config_file", "1.0.0", deprecation_message)
-            original_config = original_config_file
-
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        token = kwargs.pop("token", None)
-        cache_dir = kwargs.pop("cache_dir", None)
-        local_files_only = kwargs.pop("local_files_only", False)
-        revision = kwargs.pop("revision", None)
-        torch_dtype = kwargs.pop("torch_dtype", None)
-        disable_mmap = kwargs.pop("disable_mmap", False)
-
-        is_legacy_loading = False
-
-        if torch_dtype is not None and not isinstance(torch_dtype, torch.dtype):
-            torch_dtype = torch.float32
-            logger.warning(
-                f"Passed `torch_dtype` {torch_dtype} is not a `torch.dtype`. Defaulting to `torch.float32`."
-            )
-
-        # We shouldn't allow configuring individual models components through a Pipeline creation method
-        # These model kwargs should be deprecated
-        scaling_factor = kwargs.get("scaling_factor", None)
-        if scaling_factor is not None:
-            deprecation_message = (
-                "Passing the `scaling_factor` argument to `from_single_file is deprecated "
-                "and will be ignored in future versions."
-            )
-            deprecate("scaling_factor", "1.0.0", deprecation_message)
-
-        if original_config is not None:
-            original_config = fetch_original_config(original_config, local_files_only=local_files_only)
-
-        from ..pipelines.pipeline_utils import _get_pipeline_class
-
-        pipeline_class = _get_pipeline_class(cls, config=None)
-
-        checkpoint = load_single_file_checkpoint(
-            pretrained_model_link_or_path,
-            force_download=force_download,
-            proxies=proxies,
-            token=token,
-            cache_dir=cache_dir,
-            local_files_only=local_files_only,
-            revision=revision,
-            disable_mmap=disable_mmap,
-        )
-
-        if config is None:
-            config = fetch_diffusers_config(checkpoint)
-            default_pretrained_model_config_name = config["pretrained_model_name_or_path"]
-        else:
-            default_pretrained_model_config_name = config
-
-        if not os.path.isdir(default_pretrained_model_config_name):
-            # Provided config is a repo_id
-            if default_pretrained_model_config_name.count("/") > 1:
-                raise ValueError(
-                    f'The provided config "{config}"'
-                    " is neither a valid local path nor a valid repo id. Please check the parameter."
-                )
-            try:
-                # Attempt to download the config files for the pipeline
-                cached_model_config_path = _download_diffusers_model_config_from_hub(
-                    default_pretrained_model_config_name,
-                    cache_dir=cache_dir,
-                    revision=revision,
-                    proxies=proxies,
-                    force_download=force_download,
-                    local_files_only=local_files_only,
-                    token=token,
-                )
-                config_dict = pipeline_class.load_config(cached_model_config_path)
-
-            except LocalEntryNotFoundError:
-                # `local_files_only=True` but a local diffusers format model config is not available in the cache
-                # If `original_config` is not provided, we need override `local_files_only` to False
-                # to fetch the config files from the hub so that we have a way
-                # to configure the pipeline components.
-
-                if original_config is None:
-                    logger.warning(
-                        "`local_files_only` is True but no local configs were found for this checkpoint.\n"
-                        "Attempting to download the necessary config files for this pipeline.\n"
-                    )
-                    cached_model_config_path = _download_diffusers_model_config_from_hub(
-                        default_pretrained_model_config_name,
-                        cache_dir=cache_dir,
-                        revision=revision,
-                        proxies=proxies,
-                        force_download=force_download,
-                        local_files_only=False,
-                        token=token,
-                    )
-                    config_dict = pipeline_class.load_config(cached_model_config_path)
-
-                else:
-                    # For backwards compatibility
-                    # If `original_config` is provided, then we need to assume we are using legacy loading for pipeline components
-                    logger.warning(
-                        "Detected legacy `from_single_file` loading behavior. Attempting to create the pipeline based on inferred components.\n"
-                        "This may lead to errors if the model components are not correctly inferred. \n"
-                        "To avoid this warning, please explicity pass the `config` argument to `from_single_file` with a path to a local diffusers model repo \n"
-                        "e.g. `from_single_file(<my model checkpoint path>, config=<path to local diffusers model repo>) \n"
-                        "or run `from_single_file` with `local_files_only=False` first to update the local cache directory with "
-                        "the necessary config files.\n"
-                    )
-                    is_legacy_loading = True
-                    cached_model_config_path = None
-
-                    config_dict = _infer_pipeline_config_dict(pipeline_class)
-                    config_dict["_class_name"] = pipeline_class.__name__
-
-        else:
-            # Provided config is a path to a local directory attempt to load directly.
-            cached_model_config_path = default_pretrained_model_config_name
-            config_dict = pipeline_class.load_config(cached_model_config_path)
-
-        #   pop out "_ignore_files" as it is only needed for download
-        config_dict.pop("_ignore_files", None)
-
-        expected_modules, optional_kwargs = pipeline_class._get_signature_keys(cls)
-        passed_class_obj = {k: kwargs.pop(k) for k in expected_modules if k in kwargs}
-        passed_pipe_kwargs = {k: kwargs.pop(k) for k in optional_kwargs if k in kwargs}
-
-        init_dict, unused_kwargs, _ = pipeline_class.extract_init_dict(config_dict, **kwargs)
-        init_kwargs = {k: init_dict.pop(k) for k in optional_kwargs if k in init_dict}
-        init_kwargs = {**init_kwargs, **passed_pipe_kwargs}
-
-        from diffusers import pipelines
-
-        # remove `null` components
-        def load_module(name, value):
-            if value[0] is None:
-                return False
-            if name in passed_class_obj and passed_class_obj[name] is None:
-                return False
-            if name in SINGLE_FILE_OPTIONAL_COMPONENTS:
-                return False
-
-            return True
-
-        init_dict = {k: v for k, v in init_dict.items() if load_module(k, v)}
-
-        for name, (library_name, class_name) in logging.tqdm(
-            sorted(init_dict.items()), desc="Loading pipeline components..."
-        ):
-            loaded_sub_model = None
-            is_pipeline_module = hasattr(pipelines, library_name)
-
-            if name in passed_class_obj:
-                loaded_sub_model = passed_class_obj[name]
-
-            else:
-                try:
-                    loaded_sub_model = load_single_file_sub_model(
-                        library_name=library_name,
-                        class_name=class_name,
-                        name=name,
-                        checkpoint=checkpoint,
-                        is_pipeline_module=is_pipeline_module,
-                        cached_model_config_path=cached_model_config_path,
-                        pipelines=pipelines,
-                        torch_dtype=torch_dtype,
-                        original_config=original_config,
-                        local_files_only=local_files_only,
-                        is_legacy_loading=is_legacy_loading,
-                        disable_mmap=disable_mmap,
-                        **kwargs,
-                    )
-                except SingleFileComponentError as e:
-                    raise SingleFileComponentError(
-                        (
-                            f"{e.message}\n"
-                            f"Please load the component before passing it in as an argument to `from_single_file`.\n"
-                            f"\n"
-                            f"{name} = {class_name}.from_pretrained('...')\n"
-                            f"pipe = {pipeline_class.__name__}.from_single_file(<checkpoint path>, {name}={name})\n"
-                            f"\n"
-                        )
-                    )
-
-            init_kwargs[name] = loaded_sub_model
-
-        missing_modules = set(expected_modules) - set(init_kwargs.keys())
-        passed_modules = list(passed_class_obj.keys())
-        optional_modules = pipeline_class._optional_components
-
-        if len(missing_modules) > 0 and missing_modules <= set(passed_modules + optional_modules):
-            for module in missing_modules:
-                init_kwargs[module] = passed_class_obj.get(module, None)
-        elif len(missing_modules) > 0:
-            passed_modules = set(list(init_kwargs.keys()) + list(passed_class_obj.keys())) - optional_kwargs
-            raise ValueError(
-                f"Pipeline {pipeline_class} expected {expected_modules}, but only {passed_modules} were passed."
-            )
-
-        # deprecated kwargs
-        load_safety_checker = kwargs.pop("load_safety_checker", None)
-        if load_safety_checker is not None:
-            deprecation_message = (
-                "Please pass instances of `StableDiffusionSafetyChecker` and `AutoImageProcessor`"
-                "using the `safety_checker` and `feature_extractor` arguments in `from_single_file`"
-            )
-            deprecate("load_safety_checker", "1.0.0", deprecation_message)
-
-            safety_checker_components = _legacy_load_safety_checker(local_files_only, torch_dtype)
-            init_kwargs.update(safety_checker_components)
-
-        pipe = pipeline_class(**init_kwargs)
-
-        return pipe
+class FromSingleFileMixin(FromSingleFileMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `FromSingleFileMixin` from diffusers.loaders.single_file has been deprecated. Please use `from diffusers.loaders.single_file.single_file import FromSingleFileMixin` instead."
+        deprecate("diffusers.loaders.single_file.FromSingleFileMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)
@@ -0,0 +1,8 @@
+from ...utils import is_torch_available, is_transformers_available
+
+
+if is_torch_available():
+    from .single_file_model import FromOriginalModelMixin
+
+    if is_transformers_available():
+        from .single_file import FromSingleFileMixin
@@ -0,0 +1,565 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib
+import inspect
+import os
+
+import torch
+from huggingface_hub import snapshot_download
+from huggingface_hub.utils import LocalEntryNotFoundError, validate_hf_hub_args
+from packaging import version
+from typing_extensions import Self
+
+from ...utils import deprecate, is_transformers_available, logging
+from .single_file_utils import (
+    SingleFileComponentError,
+    _is_legacy_scheduler_kwargs,
+    _is_model_weights_in_cached_folder,
+    _legacy_load_clip_tokenizer,
+    _legacy_load_safety_checker,
+    _legacy_load_scheduler,
+    create_diffusers_clip_model_from_ldm,
+    create_diffusers_t5_model_from_checkpoint,
+    fetch_diffusers_config,
+    fetch_original_config,
+    is_clip_model_in_single_file,
+    is_t5_in_single_file,
+    load_single_file_checkpoint,
+)
+
+
+logger = logging.get_logger(__name__)
+
+# Legacy behaviour. `from_single_file` does not load the safety checker unless explicitly provided
+SINGLE_FILE_OPTIONAL_COMPONENTS = ["safety_checker"]
+
+if is_transformers_available():
+    import transformers
+    from transformers import PreTrainedModel, PreTrainedTokenizer
+
+
+def load_single_file_sub_model(
+    library_name,
+    class_name,
+    name,
+    checkpoint,
+    pipelines,
+    is_pipeline_module,
+    cached_model_config_path,
+    original_config=None,
+    local_files_only=False,
+    torch_dtype=None,
+    is_legacy_loading=False,
+    disable_mmap=False,
+    **kwargs,
+):
+    if is_pipeline_module:
+        pipeline_module = getattr(pipelines, library_name)
+        class_obj = getattr(pipeline_module, class_name)
+    else:
+        # else we just import it from the library.
+        library = importlib.import_module(library_name)
+        class_obj = getattr(library, class_name)
+
+    if is_transformers_available():
+        transformers_version = version.parse(version.parse(transformers.__version__).base_version)
+    else:
+        transformers_version = "N/A"
+
+    is_transformers_model = (
+        is_transformers_available()
+        and issubclass(class_obj, PreTrainedModel)
+        and transformers_version >= version.parse("4.20.0")
+    )
+    is_tokenizer = (
+        is_transformers_available()
+        and issubclass(class_obj, PreTrainedTokenizer)
+        and transformers_version >= version.parse("4.20.0")
+    )
+
+    diffusers_module = importlib.import_module(__name__.split(".")[0])
+    is_diffusers_single_file_model = issubclass(class_obj, diffusers_module.FromOriginalModelMixin)
+    is_diffusers_model = issubclass(class_obj, diffusers_module.ModelMixin)
+    is_diffusers_scheduler = issubclass(class_obj, diffusers_module.SchedulerMixin)
+
+    if is_diffusers_single_file_model:
+        load_method = getattr(class_obj, "from_single_file")
+
+        # We cannot provide two different config options to the `from_single_file` method
+        # Here we have to ignore loading the config from `cached_model_config_path` if `original_config` is provided
+        if original_config:
+            cached_model_config_path = None
+
+        loaded_sub_model = load_method(
+            pretrained_model_link_or_path_or_dict=checkpoint,
+            original_config=original_config,
+            config=cached_model_config_path,
+            subfolder=name,
+            torch_dtype=torch_dtype,
+            local_files_only=local_files_only,
+            disable_mmap=disable_mmap,
+            **kwargs,
+        )
+
+    elif is_transformers_model and is_clip_model_in_single_file(class_obj, checkpoint):
+        loaded_sub_model = create_diffusers_clip_model_from_ldm(
+            class_obj,
+            checkpoint=checkpoint,
+            config=cached_model_config_path,
+            subfolder=name,
+            torch_dtype=torch_dtype,
+            local_files_only=local_files_only,
+            is_legacy_loading=is_legacy_loading,
+        )
+
+    elif is_transformers_model and is_t5_in_single_file(checkpoint):
+        loaded_sub_model = create_diffusers_t5_model_from_checkpoint(
+            class_obj,
+            checkpoint=checkpoint,
+            config=cached_model_config_path,
+            subfolder=name,
+            torch_dtype=torch_dtype,
+            local_files_only=local_files_only,
+        )
+
+    elif is_tokenizer and is_legacy_loading:
+        loaded_sub_model = _legacy_load_clip_tokenizer(
+            class_obj, checkpoint=checkpoint, config=cached_model_config_path, local_files_only=local_files_only
+        )
+
+    elif is_diffusers_scheduler and (is_legacy_loading or _is_legacy_scheduler_kwargs(kwargs)):
+        loaded_sub_model = _legacy_load_scheduler(
+            class_obj, checkpoint=checkpoint, component_name=name, original_config=original_config, **kwargs
+        )
+
+    else:
+        if not hasattr(class_obj, "from_pretrained"):
+            raise ValueError(
+                (
+                    f"The component {class_obj.__name__} cannot be loaded as it does not seem to have"
+                    " a supported loading method."
+                )
+            )
+
+        loading_kwargs = {}
+        loading_kwargs.update(
+            {
+                "pretrained_model_name_or_path": cached_model_config_path,
+                "subfolder": name,
+                "local_files_only": local_files_only,
+            }
+        )
+
+        # Schedulers and Tokenizers don't make use of torch_dtype
+        # Skip passing it to those objects
+        if issubclass(class_obj, torch.nn.Module):
+            loading_kwargs.update({"torch_dtype": torch_dtype})
+
+        if is_diffusers_model or is_transformers_model:
+            if not _is_model_weights_in_cached_folder(cached_model_config_path, name):
+                raise SingleFileComponentError(
+                    f"Failed to load {class_name}. Weights for this component appear to be missing in the checkpoint."
+                )
+
+        load_method = getattr(class_obj, "from_pretrained")
+        loaded_sub_model = load_method(**loading_kwargs)
+
+    return loaded_sub_model
+
+
+def _map_component_types_to_config_dict(component_types):
+    diffusers_module = importlib.import_module(__name__.split(".")[0])
+    config_dict = {}
+    component_types.pop("self", None)
+
+    if is_transformers_available():
+        transformers_version = version.parse(version.parse(transformers.__version__).base_version)
+    else:
+        transformers_version = "N/A"
+
+    for component_name, component_value in component_types.items():
+        is_diffusers_model = issubclass(component_value[0], diffusers_module.ModelMixin)
+        is_scheduler_enum = component_value[0].__name__ == "KarrasDiffusionSchedulers"
+        is_scheduler = issubclass(component_value[0], diffusers_module.SchedulerMixin)
+
+        is_transformers_model = (
+            is_transformers_available()
+            and issubclass(component_value[0], PreTrainedModel)
+            and transformers_version >= version.parse("4.20.0")
+        )
+        is_transformers_tokenizer = (
+            is_transformers_available()
+            and issubclass(component_value[0], PreTrainedTokenizer)
+            and transformers_version >= version.parse("4.20.0")
+        )
+
+        if is_diffusers_model and component_name not in SINGLE_FILE_OPTIONAL_COMPONENTS:
+            config_dict[component_name] = ["diffusers", component_value[0].__name__]
+
+        elif is_scheduler_enum or is_scheduler:
+            if is_scheduler_enum:
+                # Since we cannot fetch a scheduler config from the hub, we default to DDIMScheduler
+                # if the type hint is a KarrassDiffusionSchedulers enum
+                config_dict[component_name] = ["diffusers", "DDIMScheduler"]
+
+            elif is_scheduler:
+                config_dict[component_name] = ["diffusers", component_value[0].__name__]
+
+        elif (
+            is_transformers_model or is_transformers_tokenizer
+        ) and component_name not in SINGLE_FILE_OPTIONAL_COMPONENTS:
+            config_dict[component_name] = ["transformers", component_value[0].__name__]
+
+        else:
+            config_dict[component_name] = [None, None]
+
+    return config_dict
+
+
+def _infer_pipeline_config_dict(pipeline_class):
+    parameters = inspect.signature(pipeline_class.__init__).parameters
+    required_parameters = {k: v for k, v in parameters.items() if v.default == inspect._empty}
+    component_types = pipeline_class._get_signature_types()
+
+    # Ignore parameters that are not required for the pipeline
+    component_types = {k: v for k, v in component_types.items() if k in required_parameters}
+    config_dict = _map_component_types_to_config_dict(component_types)
+
+    return config_dict
+
+
+def _download_diffusers_model_config_from_hub(
+    pretrained_model_name_or_path,
+    cache_dir,
+    revision,
+    proxies,
+    force_download=None,
+    local_files_only=None,
+    token=None,
+):
+    allow_patterns = ["**/*.json", "*.json", "*.txt", "**/*.txt", "**/*.model"]
+    cached_model_path = snapshot_download(
+        pretrained_model_name_or_path,
+        cache_dir=cache_dir,
+        revision=revision,
+        proxies=proxies,
+        force_download=force_download,
+        local_files_only=local_files_only,
+        token=token,
+        allow_patterns=allow_patterns,
+    )
+
+    return cached_model_path
+
+
+class FromSingleFileMixin:
+    """
+    Load model weights saved in the `.ckpt` format into a [`DiffusionPipeline`].
+    """
+
+    @classmethod
+    @validate_hf_hub_args
+    def from_single_file(cls, pretrained_model_link_or_path, **kwargs) -> Self:
+        r"""
+        Instantiate a [`DiffusionPipeline`] from pretrained pipeline weights saved in the `.ckpt` or `.safetensors`
+        format. The pipeline is set in evaluation mode (`model.eval()`) by default.
+
+        Parameters:
+            pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*):
+                Can be either:
+                    - A link to the `.ckpt` file (for example
+                      `"https://huggingface.co/<repo_id>/blob/main/<path_to_file>.ckpt"`) on the Hub.
+                    - A path to a *file* containing all pipeline weights.
+            torch_dtype (`str` or `torch.dtype`, *optional*):
+                Override the default `torch.dtype` and load the model with another dtype.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+            cache_dir (`Union[str, os.PathLike]`, *optional*):
+                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
+                is not used.
+
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            local_files_only (`bool`, *optional*, defaults to `False`):
+                Whether to only load local model weights and configuration files or not. If set to `True`, the model
+                won't be downloaded from the Hub.
+            token (`str` or *bool*, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
+                `diffusers-cli login` (stored in `~/.huggingface`) is used.
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
+                allowed by Git.
+            original_config_file (`str`, *optional*):
+                The path to the original config file that was used to train the model. If not provided, the config file
+                will be inferred from the checkpoint file.
+            config (`str`, *optional*):
+                Can be either:
+                    - A string, the *repo id* (for example `CompVis/ldm-text2im-large-256`) of a pretrained pipeline
+                      hosted on the Hub.
+                    - A path to a *directory* (for example `./my_pipeline_directory/`) containing the pipeline
+                      component configs in Diffusers format.
+            disable_mmap ('bool', *optional*, defaults to 'False'):
+                Whether to disable mmap when loading a Safetensors model. This option can perform better when the model
+                is on a network mount or hard drive.
+            kwargs (remaining dictionary of keyword arguments, *optional*):
+                Can be used to overwrite load and saveable variables (the pipeline components of the specific pipeline
+                class). The overwritten components are passed directly to the pipelines `__init__` method. See example
+                below for more information.
+
+        Examples:
+
+        ```py
+        >>> from diffusers import StableDiffusionPipeline
+
+        >>> # Download pipeline from huggingface.co and cache.
+        >>> pipeline = StableDiffusionPipeline.from_single_file(
+        ...     "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors"
+        ... )
+
+        >>> # Download pipeline from local file
+        >>> # file is downloaded under ./v1-5-pruned-emaonly.ckpt
+        >>> pipeline = StableDiffusionPipeline.from_single_file("./v1-5-pruned-emaonly.ckpt")
+
+        >>> # Enable float16 and move to GPU
+        >>> pipeline = StableDiffusionPipeline.from_single_file(
+        ...     "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt",
+        ...     torch_dtype=torch.float16,
+        ... )
+        >>> pipeline.to("cuda")
+        ```
+
+        """
+        original_config_file = kwargs.pop("original_config_file", None)
+        config = kwargs.pop("config", None)
+        original_config = kwargs.pop("original_config", None)
+
+        if original_config_file is not None:
+            deprecation_message = (
+                "`original_config_file` argument is deprecated and will be removed in future versions."
+                "please use the `original_config` argument instead."
+            )
+            deprecate("original_config_file", "1.0.0", deprecation_message)
+            original_config = original_config_file
+
+        force_download = kwargs.pop("force_download", False)
+        proxies = kwargs.pop("proxies", None)
+        token = kwargs.pop("token", None)
+        cache_dir = kwargs.pop("cache_dir", None)
+        local_files_only = kwargs.pop("local_files_only", False)
+        revision = kwargs.pop("revision", None)
+        torch_dtype = kwargs.pop("torch_dtype", None)
+        disable_mmap = kwargs.pop("disable_mmap", False)
+
+        is_legacy_loading = False
+
+        if torch_dtype is not None and not isinstance(torch_dtype, torch.dtype):
+            torch_dtype = torch.float32
+            logger.warning(
+                f"Passed `torch_dtype` {torch_dtype} is not a `torch.dtype`. Defaulting to `torch.float32`."
+            )
+
+        # We shouldn't allow configuring individual models components through a Pipeline creation method
+        # These model kwargs should be deprecated
+        scaling_factor = kwargs.get("scaling_factor", None)
+        if scaling_factor is not None:
+            deprecation_message = (
+                "Passing the `scaling_factor` argument to `from_single_file is deprecated "
+                "and will be ignored in future versions."
+            )
+            deprecate("scaling_factor", "1.0.0", deprecation_message)
+
+        if original_config is not None:
+            original_config = fetch_original_config(original_config, local_files_only=local_files_only)
+
+        from ..pipelines.pipeline_utils import _get_pipeline_class
+
+        pipeline_class = _get_pipeline_class(cls, config=None)
+
+        checkpoint = load_single_file_checkpoint(
+            pretrained_model_link_or_path,
+            force_download=force_download,
+            proxies=proxies,
+            token=token,
+            cache_dir=cache_dir,
+            local_files_only=local_files_only,
+            revision=revision,
+            disable_mmap=disable_mmap,
+        )
+
+        if config is None:
+            config = fetch_diffusers_config(checkpoint)
+            default_pretrained_model_config_name = config["pretrained_model_name_or_path"]
+        else:
+            default_pretrained_model_config_name = config
+
+        if not os.path.isdir(default_pretrained_model_config_name):
+            # Provided config is a repo_id
+            if default_pretrained_model_config_name.count("/") > 1:
+                raise ValueError(
+                    f'The provided config "{config}"'
+                    " is neither a valid local path nor a valid repo id. Please check the parameter."
+                )
+            try:
+                # Attempt to download the config files for the pipeline
+                cached_model_config_path = _download_diffusers_model_config_from_hub(
+                    default_pretrained_model_config_name,
+                    cache_dir=cache_dir,
+                    revision=revision,
+                    proxies=proxies,
+                    force_download=force_download,
+                    local_files_only=local_files_only,
+                    token=token,
+                )
+                config_dict = pipeline_class.load_config(cached_model_config_path)
+
+            except LocalEntryNotFoundError:
+                # `local_files_only=True` but a local diffusers format model config is not available in the cache
+                # If `original_config` is not provided, we need override `local_files_only` to False
+                # to fetch the config files from the hub so that we have a way
+                # to configure the pipeline components.
+
+                if original_config is None:
+                    logger.warning(
+                        "`local_files_only` is True but no local configs were found for this checkpoint.\n"
+                        "Attempting to download the necessary config files for this pipeline.\n"
+                    )
+                    cached_model_config_path = _download_diffusers_model_config_from_hub(
+                        default_pretrained_model_config_name,
+                        cache_dir=cache_dir,
+                        revision=revision,
+                        proxies=proxies,
+                        force_download=force_download,
+                        local_files_only=False,
+                        token=token,
+                    )
+                    config_dict = pipeline_class.load_config(cached_model_config_path)
+
+                else:
+                    # For backwards compatibility
+                    # If `original_config` is provided, then we need to assume we are using legacy loading for pipeline components
+                    logger.warning(
+                        "Detected legacy `from_single_file` loading behavior. Attempting to create the pipeline based on inferred components.\n"
+                        "This may lead to errors if the model components are not correctly inferred. \n"
+                        "To avoid this warning, please explicity pass the `config` argument to `from_single_file` with a path to a local diffusers model repo \n"
+                        "e.g. `from_single_file(<my model checkpoint path>, config=<path to local diffusers model repo>) \n"
+                        "or run `from_single_file` with `local_files_only=False` first to update the local cache directory with "
+                        "the necessary config files.\n"
+                    )
+                    is_legacy_loading = True
+                    cached_model_config_path = None
+
+                    config_dict = _infer_pipeline_config_dict(pipeline_class)
+                    config_dict["_class_name"] = pipeline_class.__name__
+
+        else:
+            # Provided config is a path to a local directory attempt to load directly.
+            cached_model_config_path = default_pretrained_model_config_name
+            config_dict = pipeline_class.load_config(cached_model_config_path)
+
+        #   pop out "_ignore_files" as it is only needed for download
+        config_dict.pop("_ignore_files", None)
+
+        expected_modules, optional_kwargs = pipeline_class._get_signature_keys(cls)
+        passed_class_obj = {k: kwargs.pop(k) for k in expected_modules if k in kwargs}
+        passed_pipe_kwargs = {k: kwargs.pop(k) for k in optional_kwargs if k in kwargs}
+
+        init_dict, unused_kwargs, _ = pipeline_class.extract_init_dict(config_dict, **kwargs)
+        init_kwargs = {k: init_dict.pop(k) for k in optional_kwargs if k in init_dict}
+        init_kwargs = {**init_kwargs, **passed_pipe_kwargs}
+
+        from diffusers import pipelines
+
+        # remove `null` components
+        def load_module(name, value):
+            if value[0] is None:
+                return False
+            if name in passed_class_obj and passed_class_obj[name] is None:
+                return False
+            if name in SINGLE_FILE_OPTIONAL_COMPONENTS:
+                return False
+
+            return True
+
+        init_dict = {k: v for k, v in init_dict.items() if load_module(k, v)}
+
+        for name, (library_name, class_name) in logging.tqdm(
+            sorted(init_dict.items()), desc="Loading pipeline components..."
+        ):
+            loaded_sub_model = None
+            is_pipeline_module = hasattr(pipelines, library_name)
+
+            if name in passed_class_obj:
+                loaded_sub_model = passed_class_obj[name]
+
+            else:
+                try:
+                    loaded_sub_model = load_single_file_sub_model(
+                        library_name=library_name,
+                        class_name=class_name,
+                        name=name,
+                        checkpoint=checkpoint,
+                        is_pipeline_module=is_pipeline_module,
+                        cached_model_config_path=cached_model_config_path,
+                        pipelines=pipelines,
+                        torch_dtype=torch_dtype,
+                        original_config=original_config,
+                        local_files_only=local_files_only,
+                        is_legacy_loading=is_legacy_loading,
+                        disable_mmap=disable_mmap,
+                        **kwargs,
+                    )
+                except SingleFileComponentError as e:
+                    raise SingleFileComponentError(
+                        (
+                            f"{e.message}\n"
+                            f"Please load the component before passing it in as an argument to `from_single_file`.\n"
+                            f"\n"
+                            f"{name} = {class_name}.from_pretrained('...')\n"
+                            f"pipe = {pipeline_class.__name__}.from_single_file(<checkpoint path>, {name}={name})\n"
+                            f"\n"
+                        )
+                    )
+
+            init_kwargs[name] = loaded_sub_model
+
+        missing_modules = set(expected_modules) - set(init_kwargs.keys())
+        passed_modules = list(passed_class_obj.keys())
+        optional_modules = pipeline_class._optional_components
+
+        if len(missing_modules) > 0 and missing_modules <= set(passed_modules + optional_modules):
+            for module in missing_modules:
+                init_kwargs[module] = passed_class_obj.get(module, None)
+        elif len(missing_modules) > 0:
+            passed_modules = set(list(init_kwargs.keys()) + list(passed_class_obj.keys())) - optional_kwargs
+            raise ValueError(
+                f"Pipeline {pipeline_class} expected {expected_modules}, but only {passed_modules} were passed."
+            )
+
+        # deprecated kwargs
+        load_safety_checker = kwargs.pop("load_safety_checker", None)
+        if load_safety_checker is not None:
+            deprecation_message = (
+                "Please pass instances of `StableDiffusionSafetyChecker` and `AutoImageProcessor`"
+                "using the `safety_checker` and `feature_extractor` arguments in `from_single_file`"
+            )
+            deprecate("load_safety_checker", "1.0.0", deprecation_message)
+
+            safety_checker_components = _legacy_load_safety_checker(local_files_only, torch_dtype)
+            init_kwargs.update(safety_checker_components)
+
+        pipe = pipeline_class(**init_kwargs)
+
+        return pipe
@@ -0,0 +1,440 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import importlib
+import inspect
+import re
+from contextlib import nullcontext
+from typing import Optional
+
+import torch
+from huggingface_hub.utils import validate_hf_hub_args
+from typing_extensions import Self
+
+from ... import __version__
+from ...quantizers import DiffusersAutoQuantizer
+from ...utils import deprecate, is_accelerate_available, logging
+from .single_file_utils import (
+    SingleFileComponentError,
+    convert_animatediff_checkpoint_to_diffusers,
+    convert_auraflow_transformer_checkpoint_to_diffusers,
+    convert_autoencoder_dc_checkpoint_to_diffusers,
+    convert_controlnet_checkpoint,
+    convert_flux_transformer_checkpoint_to_diffusers,
+    convert_hunyuan_video_transformer_to_diffusers,
+    convert_ldm_unet_checkpoint,
+    convert_ldm_vae_checkpoint,
+    convert_ltx_transformer_checkpoint_to_diffusers,
+    convert_ltx_vae_checkpoint_to_diffusers,
+    convert_lumina2_to_diffusers,
+    convert_mochi_transformer_checkpoint_to_diffusers,
+    convert_sana_transformer_to_diffusers,
+    convert_sd3_transformer_checkpoint_to_diffusers,
+    convert_stable_cascade_unet_single_file_to_diffusers,
+    convert_wan_transformer_to_diffusers,
+    convert_wan_vae_to_diffusers,
+    create_controlnet_diffusers_config_from_ldm,
+    create_unet_diffusers_config_from_ldm,
+    create_vae_diffusers_config_from_ldm,
+    fetch_diffusers_config,
+    fetch_original_config,
+    load_single_file_checkpoint,
+)
+
+
+logger = logging.get_logger(__name__)
+
+
+if is_accelerate_available():
+    from accelerate import dispatch_model, init_empty_weights
+
+    from ...models.modeling_utils import load_model_dict_into_meta
+
+
+SINGLE_FILE_LOADABLE_CLASSES = {
+    "StableCascadeUNet": {
+        "checkpoint_mapping_fn": convert_stable_cascade_unet_single_file_to_diffusers,
+    },
+    "UNet2DConditionModel": {
+        "checkpoint_mapping_fn": convert_ldm_unet_checkpoint,
+        "config_mapping_fn": create_unet_diffusers_config_from_ldm,
+        "default_subfolder": "unet",
+        "legacy_kwargs": {
+            "num_in_channels": "in_channels",  # Legacy kwargs supported by `from_single_file` mapped to new args
+        },
+    },
+    "AutoencoderKL": {
+        "checkpoint_mapping_fn": convert_ldm_vae_checkpoint,
+        "config_mapping_fn": create_vae_diffusers_config_from_ldm,
+        "default_subfolder": "vae",
+    },
+    "ControlNetModel": {
+        "checkpoint_mapping_fn": convert_controlnet_checkpoint,
+        "config_mapping_fn": create_controlnet_diffusers_config_from_ldm,
+    },
+    "SD3Transformer2DModel": {
+        "checkpoint_mapping_fn": convert_sd3_transformer_checkpoint_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "MotionAdapter": {
+        "checkpoint_mapping_fn": convert_animatediff_checkpoint_to_diffusers,
+    },
+    "SparseControlNetModel": {
+        "checkpoint_mapping_fn": convert_animatediff_checkpoint_to_diffusers,
+    },
+    "FluxTransformer2DModel": {
+        "checkpoint_mapping_fn": convert_flux_transformer_checkpoint_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "LTXVideoTransformer3DModel": {
+        "checkpoint_mapping_fn": convert_ltx_transformer_checkpoint_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "AutoencoderKLLTXVideo": {
+        "checkpoint_mapping_fn": convert_ltx_vae_checkpoint_to_diffusers,
+        "default_subfolder": "vae",
+    },
+    "AutoencoderDC": {"checkpoint_mapping_fn": convert_autoencoder_dc_checkpoint_to_diffusers},
+    "MochiTransformer3DModel": {
+        "checkpoint_mapping_fn": convert_mochi_transformer_checkpoint_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "HunyuanVideoTransformer3DModel": {
+        "checkpoint_mapping_fn": convert_hunyuan_video_transformer_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "AuraFlowTransformer2DModel": {
+        "checkpoint_mapping_fn": convert_auraflow_transformer_checkpoint_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "Lumina2Transformer2DModel": {
+        "checkpoint_mapping_fn": convert_lumina2_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "SanaTransformer2DModel": {
+        "checkpoint_mapping_fn": convert_sana_transformer_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "WanTransformer3DModel": {
+        "checkpoint_mapping_fn": convert_wan_transformer_to_diffusers,
+        "default_subfolder": "transformer",
+    },
+    "AutoencoderKLWan": {
+        "checkpoint_mapping_fn": convert_wan_vae_to_diffusers,
+        "default_subfolder": "vae",
+    },
+}
+
+
+def _get_single_file_loadable_mapping_class(cls):
+    diffusers_module = importlib.import_module(__name__.split(".")[0])
+    for loadable_class_str in SINGLE_FILE_LOADABLE_CLASSES:
+        loadable_class = getattr(diffusers_module, loadable_class_str)
+
+        if issubclass(cls, loadable_class):
+            return loadable_class_str
+
+    return None
+
+
+def _get_mapping_function_kwargs(mapping_fn, **kwargs):
+    parameters = inspect.signature(mapping_fn).parameters
+
+    mapping_kwargs = {}
+    for parameter in parameters:
+        if parameter in kwargs:
+            mapping_kwargs[parameter] = kwargs[parameter]
+
+    return mapping_kwargs
+
+
+class FromOriginalModelMixin:
+    """
+    Load pretrained weights saved in the `.ckpt` or `.safetensors` format into a model.
+    """
+
+    @classmethod
+    @validate_hf_hub_args
+    def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] = None, **kwargs) -> Self:
+        r"""
+        Instantiate a model from pretrained weights saved in the original `.ckpt` or `.safetensors` format. The model
+        is set in evaluation mode (`model.eval()`) by default.
+
+        Parameters:
+            pretrained_model_link_or_path_or_dict (`str`, *optional*):
+                Can be either:
+                    - A link to the `.safetensors` or `.ckpt` file (for example
+                      `"https://huggingface.co/<repo_id>/blob/main/<path_to_file>.safetensors"`) on the Hub.
+                    - A path to a local *file* containing the weights of the component model.
+                    - A state dict containing the component model weights.
+            config (`str`, *optional*):
+                - A string, the *repo id* (for example `CompVis/ldm-text2im-large-256`) of a pretrained pipeline hosted
+                  on the Hub.
+                - A path to a *directory* (for example `./my_pipeline_directory/`) containing the pipeline component
+                  configs in Diffusers format.
+            subfolder (`str`, *optional*, defaults to `""`):
+                The subfolder location of a model file within a larger model repository on the Hub or locally.
+            original_config (`str`, *optional*):
+                Dict or path to a yaml file containing the configuration for the model in its original format.
+                    If a dict is provided, it will be used to initialize the model configuration.
+            torch_dtype (`str` or `torch.dtype`, *optional*):
+                Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the
+                dtype is automatically derived from the model's weights.
+            force_download (`bool`, *optional*, defaults to `False`):
+                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
+                cached versions if they exist.
+            cache_dir (`Union[str, os.PathLike]`, *optional*):
+                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
+                is not used.
+
+            proxies (`Dict[str, str]`, *optional*):
+                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
+                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
+            local_files_only (`bool`, *optional*, defaults to `False`):
+                Whether to only load local model weights and configuration files or not. If set to True, the model
+                won't be downloaded from the Hub.
+            token (`str` or *bool*, *optional*):
+                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
+                `diffusers-cli login` (stored in `~/.huggingface`) is used.
+            revision (`str`, *optional*, defaults to `"main"`):
+                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
+                allowed by Git.
+            disable_mmap ('bool', *optional*, defaults to 'False'):
+                Whether to disable mmap when loading a Safetensors model. This option can perform better when the model
+                is on a network mount or hard drive, which may not handle the seeky-ness of mmap very well.
+            kwargs (remaining dictionary of keyword arguments, *optional*):
+                Can be used to overwrite load and saveable variables (for example the pipeline components of the
+                specific pipeline class). The overwritten components are directly passed to the pipelines `__init__`
+                method. See example below for more information.
+
+        ```py
+        >>> from diffusers import StableCascadeUNet
+
+        >>> ckpt_path = "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_lite.safetensors"
+        >>> model = StableCascadeUNet.from_single_file(ckpt_path)
+        ```
+        """
+
+        mapping_class_name = _get_single_file_loadable_mapping_class(cls)
+        # if class_name not in SINGLE_FILE_LOADABLE_CLASSES:
+        if mapping_class_name is None:
+            raise ValueError(
+                f"FromOriginalModelMixin is currently only compatible with {', '.join(SINGLE_FILE_LOADABLE_CLASSES.keys())}"
+            )
+
+        pretrained_model_link_or_path = kwargs.get("pretrained_model_link_or_path", None)
+        if pretrained_model_link_or_path is not None:
+            deprecation_message = (
+                "Please use `pretrained_model_link_or_path_or_dict` argument instead for model classes"
+            )
+            deprecate("pretrained_model_link_or_path", "1.0.0", deprecation_message)
+            pretrained_model_link_or_path_or_dict = pretrained_model_link_or_path
+
+        config = kwargs.pop("config", None)
+        original_config = kwargs.pop("original_config", None)
+
+        if config is not None and original_config is not None:
+            raise ValueError(
+                "`from_single_file` cannot accept both `config` and `original_config` arguments. Please provide only one of these arguments"
+            )
+
+        force_download = kwargs.pop("force_download", False)
+        proxies = kwargs.pop("proxies", None)
+        token = kwargs.pop("token", None)
+        cache_dir = kwargs.pop("cache_dir", None)
+        local_files_only = kwargs.pop("local_files_only", None)
+        subfolder = kwargs.pop("subfolder", None)
+        revision = kwargs.pop("revision", None)
+        config_revision = kwargs.pop("config_revision", None)
+        torch_dtype = kwargs.pop("torch_dtype", None)
+        quantization_config = kwargs.pop("quantization_config", None)
+        device = kwargs.pop("device", None)
+        disable_mmap = kwargs.pop("disable_mmap", False)
+
+        user_agent = {"diffusers": __version__, "file_type": "single_file", "framework": "pytorch"}
+        # In order to ensure popular quantization methods are supported. Can be disable with `disable_telemetry`
+        if quantization_config is not None:
+            user_agent["quant"] = quantization_config.quant_method.value
+
+        if torch_dtype is not None and not isinstance(torch_dtype, torch.dtype):
+            torch_dtype = torch.float32
+            logger.warning(
+                f"Passed `torch_dtype` {torch_dtype} is not a `torch.dtype`. Defaulting to `torch.float32`."
+            )
+
+        if isinstance(pretrained_model_link_or_path_or_dict, dict):
+            checkpoint = pretrained_model_link_or_path_or_dict
+        else:
+            checkpoint = load_single_file_checkpoint(
+                pretrained_model_link_or_path_or_dict,
+                force_download=force_download,
+                proxies=proxies,
+                token=token,
+                cache_dir=cache_dir,
+                local_files_only=local_files_only,
+                revision=revision,
+                disable_mmap=disable_mmap,
+                user_agent=user_agent,
+            )
+        if quantization_config is not None:
+            hf_quantizer = DiffusersAutoQuantizer.from_config(quantization_config)
+            hf_quantizer.validate_environment()
+            torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype)
+
+        else:
+            hf_quantizer = None
+
+        mapping_functions = SINGLE_FILE_LOADABLE_CLASSES[mapping_class_name]
+
+        checkpoint_mapping_fn = mapping_functions["checkpoint_mapping_fn"]
+        if original_config is not None:
+            if "config_mapping_fn" in mapping_functions:
+                config_mapping_fn = mapping_functions["config_mapping_fn"]
+            else:
+                config_mapping_fn = None
+
+            if config_mapping_fn is None:
+                raise ValueError(
+                    (
+                        f"`original_config` has been provided for {mapping_class_name} but no mapping function"
+                        "was found to convert the original config to a Diffusers config in"
+                        "`diffusers.loaders.single_file_utils`"
+                    )
+                )
+
+            if isinstance(original_config, str):
+                # If original_config is a URL or filepath fetch the original_config dict
+                original_config = fetch_original_config(original_config, local_files_only=local_files_only)
+
+            config_mapping_kwargs = _get_mapping_function_kwargs(config_mapping_fn, **kwargs)
+            diffusers_model_config = config_mapping_fn(
+                original_config=original_config, checkpoint=checkpoint, **config_mapping_kwargs
+            )
+        else:
+            if config is not None:
+                if isinstance(config, str):
+                    default_pretrained_model_config_name = config
+                else:
+                    raise ValueError(
+                        (
+                            "Invalid `config` argument. Please provide a string representing a repo id"
+                            "or path to a local Diffusers model repo."
+                        )
+                    )
+
+            else:
+                config = fetch_diffusers_config(checkpoint)
+                default_pretrained_model_config_name = config["pretrained_model_name_or_path"]
+
+                if "default_subfolder" in mapping_functions:
+                    subfolder = mapping_functions["default_subfolder"]
+
+                subfolder = subfolder or config.pop(
+                    "subfolder", None
+                )  # some configs contain a subfolder key, e.g. StableCascadeUNet
+
+            diffusers_model_config = cls.load_config(
+                pretrained_model_name_or_path=default_pretrained_model_config_name,
+                subfolder=subfolder,
+                local_files_only=local_files_only,
+                token=token,
+                revision=config_revision,
+            )
+            expected_kwargs, optional_kwargs = cls._get_signature_keys(cls)
+
+            # Map legacy kwargs to new kwargs
+            if "legacy_kwargs" in mapping_functions:
+                legacy_kwargs = mapping_functions["legacy_kwargs"]
+                for legacy_key, new_key in legacy_kwargs.items():
+                    if legacy_key in kwargs:
+                        kwargs[new_key] = kwargs.pop(legacy_key)
+
+            model_kwargs = {k: kwargs.get(k) for k in kwargs if k in expected_kwargs or k in optional_kwargs}
+            diffusers_model_config.update(model_kwargs)
+
+        checkpoint_mapping_kwargs = _get_mapping_function_kwargs(checkpoint_mapping_fn, **kwargs)
+        diffusers_format_checkpoint = checkpoint_mapping_fn(
+            config=diffusers_model_config, checkpoint=checkpoint, **checkpoint_mapping_kwargs
+        )
+        if not diffusers_format_checkpoint:
+            raise SingleFileComponentError(
+                f"Failed to load {mapping_class_name}. Weights for this component appear to be missing in the checkpoint."
+            )
+
+        ctx = init_empty_weights if is_accelerate_available() else nullcontext
+        with ctx():
+            model = cls.from_config(diffusers_model_config)
+
+        # Check if `_keep_in_fp32_modules` is not None
+        use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and (
+            (torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules")
+        )
+        if use_keep_in_fp32_modules:
+            keep_in_fp32_modules = cls._keep_in_fp32_modules
+            if not isinstance(keep_in_fp32_modules, list):
+                keep_in_fp32_modules = [keep_in_fp32_modules]
+
+        else:
+            keep_in_fp32_modules = []
+
+        if hf_quantizer is not None:
+            hf_quantizer.preprocess_model(
+                model=model,
+                device_map=None,
+                state_dict=diffusers_format_checkpoint,
+                keep_in_fp32_modules=keep_in_fp32_modules,
+            )
+
+        device_map = None
+        if is_accelerate_available():
+            param_device = torch.device(device) if device else torch.device("cpu")
+            empty_state_dict = model.state_dict()
+            unexpected_keys = [
+                param_name for param_name in diffusers_format_checkpoint if param_name not in empty_state_dict
+            ]
+            device_map = {"": param_device}
+            load_model_dict_into_meta(
+                model,
+                diffusers_format_checkpoint,
+                dtype=torch_dtype,
+                device_map=device_map,
+                hf_quantizer=hf_quantizer,
+                keep_in_fp32_modules=keep_in_fp32_modules,
+                unexpected_keys=unexpected_keys,
+            )
+        else:
+            _, unexpected_keys = model.load_state_dict(diffusers_format_checkpoint, strict=False)
+
+        if model._keys_to_ignore_on_load_unexpected is not None:
+            for pat in model._keys_to_ignore_on_load_unexpected:
+                unexpected_keys = [k for k in unexpected_keys if re.search(pat, k) is None]
+
+        if len(unexpected_keys) > 0:
+            logger.warning(
+                f"Some weights of the model checkpoint were not used when initializing {cls.__name__}: \n {[', '.join(unexpected_keys)]}"
+            )
+
+        if hf_quantizer is not None:
+            hf_quantizer.postprocess_model(model)
+            model.hf_quantizer = hf_quantizer
+
+        if torch_dtype is not None and hf_quantizer is None:
+            model.to(torch_dtype)
+
+        model.eval()
+
+        if device_map is not None:
+            device_map_kwargs = {"device_map": device_map}
+            dispatch_model(model, **device_map_kwargs)
+
+        return model
@@ -11,430 +11,17 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import importlib
-import inspect
-import re
-from contextlib import nullcontext
-from typing import Optional

-import torch
-from huggingface_hub.utils import validate_hf_hub_args
-from typing_extensions import Self

-from .. import __version__
-from ..quantizers import DiffusersAutoQuantizer
-from ..utils import deprecate, is_accelerate_available, logging
-from .single_file_utils import (
-    SingleFileComponentError,
-    convert_animatediff_checkpoint_to_diffusers,
-    convert_auraflow_transformer_checkpoint_to_diffusers,
-    convert_autoencoder_dc_checkpoint_to_diffusers,
-    convert_controlnet_checkpoint,
-    convert_flux_transformer_checkpoint_to_diffusers,
-    convert_hunyuan_video_transformer_to_diffusers,
-    convert_ldm_unet_checkpoint,
-    convert_ldm_vae_checkpoint,
-    convert_ltx_transformer_checkpoint_to_diffusers,
-    convert_ltx_vae_checkpoint_to_diffusers,
-    convert_lumina2_to_diffusers,
-    convert_mochi_transformer_checkpoint_to_diffusers,
-    convert_sana_transformer_to_diffusers,
-    convert_sd3_transformer_checkpoint_to_diffusers,
-    convert_stable_cascade_unet_single_file_to_diffusers,
-    convert_wan_transformer_to_diffusers,
-    convert_wan_vae_to_diffusers,
-    create_controlnet_diffusers_config_from_ldm,
-    create_unet_diffusers_config_from_ldm,
-    create_vae_diffusers_config_from_ldm,
-    fetch_diffusers_config,
-    fetch_original_config,
-    load_single_file_checkpoint,
+from ..utils import deprecate
+from .single_file.single_file_model import (
+    SINGLE_FILE_LOADABLE_CLASSES,  # noqa: F401
+    FromOriginalModelMixin,
 )


-logger = logging.get_logger(__name__)
-
-
-if is_accelerate_available():
-    from accelerate import dispatch_model, init_empty_weights
-
-    from ..models.modeling_utils import load_model_dict_into_meta
-
-
-SINGLE_FILE_LOADABLE_CLASSES = {
-    "StableCascadeUNet": {
-        "checkpoint_mapping_fn": convert_stable_cascade_unet_single_file_to_diffusers,
-    },
-    "UNet2DConditionModel": {
-        "checkpoint_mapping_fn": convert_ldm_unet_checkpoint,
-        "config_mapping_fn": create_unet_diffusers_config_from_ldm,
-        "default_subfolder": "unet",
-        "legacy_kwargs": {
-            "num_in_channels": "in_channels",  # Legacy kwargs supported by `from_single_file` mapped to new args
-        },
-    },
-    "AutoencoderKL": {
-        "checkpoint_mapping_fn": convert_ldm_vae_checkpoint,
-        "config_mapping_fn": create_vae_diffusers_config_from_ldm,
-        "default_subfolder": "vae",
-    },
-    "ControlNetModel": {
-        "checkpoint_mapping_fn": convert_controlnet_checkpoint,
-        "config_mapping_fn": create_controlnet_diffusers_config_from_ldm,
-    },
-    "SD3Transformer2DModel": {
-        "checkpoint_mapping_fn": convert_sd3_transformer_checkpoint_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "MotionAdapter": {
-        "checkpoint_mapping_fn": convert_animatediff_checkpoint_to_diffusers,
-    },
-    "SparseControlNetModel": {
-        "checkpoint_mapping_fn": convert_animatediff_checkpoint_to_diffusers,
-    },
-    "FluxTransformer2DModel": {
-        "checkpoint_mapping_fn": convert_flux_transformer_checkpoint_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "LTXVideoTransformer3DModel": {
-        "checkpoint_mapping_fn": convert_ltx_transformer_checkpoint_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "AutoencoderKLLTXVideo": {
-        "checkpoint_mapping_fn": convert_ltx_vae_checkpoint_to_diffusers,
-        "default_subfolder": "vae",
-    },
-    "AutoencoderDC": {"checkpoint_mapping_fn": convert_autoencoder_dc_checkpoint_to_diffusers},
-    "MochiTransformer3DModel": {
-        "checkpoint_mapping_fn": convert_mochi_transformer_checkpoint_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "HunyuanVideoTransformer3DModel": {
-        "checkpoint_mapping_fn": convert_hunyuan_video_transformer_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "AuraFlowTransformer2DModel": {
-        "checkpoint_mapping_fn": convert_auraflow_transformer_checkpoint_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "Lumina2Transformer2DModel": {
-        "checkpoint_mapping_fn": convert_lumina2_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "SanaTransformer2DModel": {
-        "checkpoint_mapping_fn": convert_sana_transformer_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "WanTransformer3DModel": {
-        "checkpoint_mapping_fn": convert_wan_transformer_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "AutoencoderKLWan": {
-        "checkpoint_mapping_fn": convert_wan_vae_to_diffusers,
-        "default_subfolder": "vae",
-    },
-}
-
-
-def _get_single_file_loadable_mapping_class(cls):
-    diffusers_module = importlib.import_module(__name__.split(".")[0])
-    for loadable_class_str in SINGLE_FILE_LOADABLE_CLASSES:
-        loadable_class = getattr(diffusers_module, loadable_class_str)
-
-        if issubclass(cls, loadable_class):
-            return loadable_class_str
-
-    return None
-
-
-def _get_mapping_function_kwargs(mapping_fn, **kwargs):
-    parameters = inspect.signature(mapping_fn).parameters
-
-    mapping_kwargs = {}
-    for parameter in parameters:
-        if parameter in kwargs:
-            mapping_kwargs[parameter] = kwargs[parameter]
-
-    return mapping_kwargs
-
-
-class FromOriginalModelMixin:
-    """
-    Load pretrained weights saved in the `.ckpt` or `.safetensors` format into a model.
-    """
-
-    @classmethod
-    @validate_hf_hub_args
-    def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] = None, **kwargs) -> Self:
-        r"""
-        Instantiate a model from pretrained weights saved in the original `.ckpt` or `.safetensors` format. The model
-        is set in evaluation mode (`model.eval()`) by default.
-
-        Parameters:
-            pretrained_model_link_or_path_or_dict (`str`, *optional*):
-                Can be either:
-                    - A link to the `.safetensors` or `.ckpt` file (for example
-                      `"https://huggingface.co/<repo_id>/blob/main/<path_to_file>.safetensors"`) on the Hub.
-                    - A path to a local *file* containing the weights of the component model.
-                    - A state dict containing the component model weights.
-            config (`str`, *optional*):
-                - A string, the *repo id* (for example `CompVis/ldm-text2im-large-256`) of a pretrained pipeline hosted
-                  on the Hub.
-                - A path to a *directory* (for example `./my_pipeline_directory/`) containing the pipeline component
-                  configs in Diffusers format.
-            subfolder (`str`, *optional*, defaults to `""`):
-                The subfolder location of a model file within a larger model repository on the Hub or locally.
-            original_config (`str`, *optional*):
-                Dict or path to a yaml file containing the configuration for the model in its original format.
-                    If a dict is provided, it will be used to initialize the model configuration.
-            torch_dtype (`str` or `torch.dtype`, *optional*):
-                Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the
-                dtype is automatically derived from the model's weights.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to True, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            disable_mmap ('bool', *optional*, defaults to 'False'):
-                Whether to disable mmap when loading a Safetensors model. This option can perform better when the model
-                is on a network mount or hard drive, which may not handle the seeky-ness of mmap very well.
-            kwargs (remaining dictionary of keyword arguments, *optional*):
-                Can be used to overwrite load and saveable variables (for example the pipeline components of the
-                specific pipeline class). The overwritten components are directly passed to the pipelines `__init__`
-                method. See example below for more information.
-
-        ```py
-        >>> from diffusers import StableCascadeUNet
-
-        >>> ckpt_path = "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_lite.safetensors"
-        >>> model = StableCascadeUNet.from_single_file(ckpt_path)
-        ```
-        """
-
-        mapping_class_name = _get_single_file_loadable_mapping_class(cls)
-        # if class_name not in SINGLE_FILE_LOADABLE_CLASSES:
-        if mapping_class_name is None:
-            raise ValueError(
-                f"FromOriginalModelMixin is currently only compatible with {', '.join(SINGLE_FILE_LOADABLE_CLASSES.keys())}"
-            )
-
-        pretrained_model_link_or_path = kwargs.get("pretrained_model_link_or_path", None)
-        if pretrained_model_link_or_path is not None:
-            deprecation_message = (
-                "Please use `pretrained_model_link_or_path_or_dict` argument instead for model classes"
-            )
-            deprecate("pretrained_model_link_or_path", "1.0.0", deprecation_message)
-            pretrained_model_link_or_path_or_dict = pretrained_model_link_or_path
-
-        config = kwargs.pop("config", None)
-        original_config = kwargs.pop("original_config", None)
-
-        if config is not None and original_config is not None:
-            raise ValueError(
-                "`from_single_file` cannot accept both `config` and `original_config` arguments. Please provide only one of these arguments"
-            )
-
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        token = kwargs.pop("token", None)
-        cache_dir = kwargs.pop("cache_dir", None)
-        local_files_only = kwargs.pop("local_files_only", None)
-        subfolder = kwargs.pop("subfolder", None)
-        revision = kwargs.pop("revision", None)
-        config_revision = kwargs.pop("config_revision", None)
-        torch_dtype = kwargs.pop("torch_dtype", None)
-        quantization_config = kwargs.pop("quantization_config", None)
-        device = kwargs.pop("device", None)
-        disable_mmap = kwargs.pop("disable_mmap", False)
-
-        user_agent = {"diffusers": __version__, "file_type": "single_file", "framework": "pytorch"}
-        # In order to ensure popular quantization methods are supported. Can be disable with `disable_telemetry`
-        if quantization_config is not None:
-            user_agent["quant"] = quantization_config.quant_method.value
-
-        if torch_dtype is not None and not isinstance(torch_dtype, torch.dtype):
-            torch_dtype = torch.float32
-            logger.warning(
-                f"Passed `torch_dtype` {torch_dtype} is not a `torch.dtype`. Defaulting to `torch.float32`."
-            )
-
-        if isinstance(pretrained_model_link_or_path_or_dict, dict):
-            checkpoint = pretrained_model_link_or_path_or_dict
-        else:
-            checkpoint = load_single_file_checkpoint(
-                pretrained_model_link_or_path_or_dict,
-                force_download=force_download,
-                proxies=proxies,
-                token=token,
-                cache_dir=cache_dir,
-                local_files_only=local_files_only,
-                revision=revision,
-                disable_mmap=disable_mmap,
-                user_agent=user_agent,
-            )
-        if quantization_config is not None:
-            hf_quantizer = DiffusersAutoQuantizer.from_config(quantization_config)
-            hf_quantizer.validate_environment()
-            torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype)
-
-        else:
-            hf_quantizer = None
-
-        mapping_functions = SINGLE_FILE_LOADABLE_CLASSES[mapping_class_name]
-
-        checkpoint_mapping_fn = mapping_functions["checkpoint_mapping_fn"]
-        if original_config is not None:
-            if "config_mapping_fn" in mapping_functions:
-                config_mapping_fn = mapping_functions["config_mapping_fn"]
-            else:
-                config_mapping_fn = None
-
-            if config_mapping_fn is None:
-                raise ValueError(
-                    (
-                        f"`original_config` has been provided for {mapping_class_name} but no mapping function"
-                        "was found to convert the original config to a Diffusers config in"
-                        "`diffusers.loaders.single_file_utils`"
-                    )
-                )
-
-            if isinstance(original_config, str):
-                # If original_config is a URL or filepath fetch the original_config dict
-                original_config = fetch_original_config(original_config, local_files_only=local_files_only)
-
-            config_mapping_kwargs = _get_mapping_function_kwargs(config_mapping_fn, **kwargs)
-            diffusers_model_config = config_mapping_fn(
-                original_config=original_config, checkpoint=checkpoint, **config_mapping_kwargs
-            )
-        else:
-            if config is not None:
-                if isinstance(config, str):
-                    default_pretrained_model_config_name = config
-                else:
-                    raise ValueError(
-                        (
-                            "Invalid `config` argument. Please provide a string representing a repo id"
-                            "or path to a local Diffusers model repo."
-                        )
-                    )
-
-            else:
-                config = fetch_diffusers_config(checkpoint)
-                default_pretrained_model_config_name = config["pretrained_model_name_or_path"]
-
-                if "default_subfolder" in mapping_functions:
-                    subfolder = mapping_functions["default_subfolder"]
-
-                subfolder = subfolder or config.pop(
-                    "subfolder", None
-                )  # some configs contain a subfolder key, e.g. StableCascadeUNet
-
-            diffusers_model_config = cls.load_config(
-                pretrained_model_name_or_path=default_pretrained_model_config_name,
-                subfolder=subfolder,
-                local_files_only=local_files_only,
-                token=token,
-                revision=config_revision,
-            )
-            expected_kwargs, optional_kwargs = cls._get_signature_keys(cls)
-
-            # Map legacy kwargs to new kwargs
-            if "legacy_kwargs" in mapping_functions:
-                legacy_kwargs = mapping_functions["legacy_kwargs"]
-                for legacy_key, new_key in legacy_kwargs.items():
-                    if legacy_key in kwargs:
-                        kwargs[new_key] = kwargs.pop(legacy_key)
-
-            model_kwargs = {k: kwargs.get(k) for k in kwargs if k in expected_kwargs or k in optional_kwargs}
-            diffusers_model_config.update(model_kwargs)
-
-        checkpoint_mapping_kwargs = _get_mapping_function_kwargs(checkpoint_mapping_fn, **kwargs)
-        diffusers_format_checkpoint = checkpoint_mapping_fn(
-            config=diffusers_model_config, checkpoint=checkpoint, **checkpoint_mapping_kwargs
-        )
-        if not diffusers_format_checkpoint:
-            raise SingleFileComponentError(
-                f"Failed to load {mapping_class_name}. Weights for this component appear to be missing in the checkpoint."
-            )
-
-        ctx = init_empty_weights if is_accelerate_available() else nullcontext
-        with ctx():
-            model = cls.from_config(diffusers_model_config)
-
-        # Check if `_keep_in_fp32_modules` is not None
-        use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and (
-            (torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules")
-        )
-        if use_keep_in_fp32_modules:
-            keep_in_fp32_modules = cls._keep_in_fp32_modules
-            if not isinstance(keep_in_fp32_modules, list):
-                keep_in_fp32_modules = [keep_in_fp32_modules]
-
-        else:
-            keep_in_fp32_modules = []
-
-        if hf_quantizer is not None:
-            hf_quantizer.preprocess_model(
-                model=model,
-                device_map=None,
-                state_dict=diffusers_format_checkpoint,
-                keep_in_fp32_modules=keep_in_fp32_modules,
-            )
-
-        device_map = None
-        if is_accelerate_available():
-            param_device = torch.device(device) if device else torch.device("cpu")
-            empty_state_dict = model.state_dict()
-            unexpected_keys = [
-                param_name for param_name in diffusers_format_checkpoint if param_name not in empty_state_dict
-            ]
-            device_map = {"": param_device}
-            load_model_dict_into_meta(
-                model,
-                diffusers_format_checkpoint,
-                dtype=torch_dtype,
-                device_map=device_map,
-                hf_quantizer=hf_quantizer,
-                keep_in_fp32_modules=keep_in_fp32_modules,
-                unexpected_keys=unexpected_keys,
-            )
-        else:
-            _, unexpected_keys = model.load_state_dict(diffusers_format_checkpoint, strict=False)
-
-        if model._keys_to_ignore_on_load_unexpected is not None:
-            for pat in model._keys_to_ignore_on_load_unexpected:
-                unexpected_keys = [k for k in unexpected_keys if re.search(pat, k) is None]
-
-        if len(unexpected_keys) > 0:
-            logger.warning(
-                f"Some weights of the model checkpoint were not used when initializing {cls.__name__}: \n {[', '.join(unexpected_keys)]}"
-            )
-
-        if hf_quantizer is not None:
-            hf_quantizer.postprocess_model(model)
-            model.hf_quantizer = hf_quantizer
-
-        if torch_dtype is not None and hf_quantizer is None:
-            model.to(torch_dtype)
-
-        model.eval()
-
-        if device_map is not None:
-            device_map_kwargs = {"device_map": device_map}
-            dispatch_model(model, **device_map_kwargs)
-
-        return model
+class FromOriginalModelMixin(FromOriginalModelMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `FromOriginalModelMixin` from diffusers.loaders.single_file_model has been deprecated. Please use `from diffusers.loaders.single_file.single_file_model import FromOriginalModelMixin` instead."
+        deprecate("diffusers.loaders.single_file_model.FromOriginalModelMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)
@@ -11,170 +11,13 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from contextlib import nullcontext

-from ..models.embeddings import (
-    ImageProjection,
-    MultiIPAdapterImageProjection,
-)
-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
-from ..utils import (
-    is_accelerate_available,
-    is_torch_version,
-    logging,
-)
+from ..utils import deprecate
+from .ip_adapter.transformer_flux import FluxTransformer2DLoadersMixin


-if is_accelerate_available():
-    pass
-
-logger = logging.get_logger(__name__)
-
-
-class FluxTransformer2DLoadersMixin:
-    """
-    Load layers into a [`FluxTransformer2DModel`].
-    """
-
-    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
-        if low_cpu_mem_usage:
-            if is_accelerate_available():
-                from accelerate import init_empty_weights
-
-            else:
-                low_cpu_mem_usage = False
-                logger.warning(
-                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                    " install accelerate\n```\n."
-                )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        updated_state_dict = {}
-        image_projection = None
-        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
-
-        if "proj.weight" in state_dict:
-            # IP-Adapter
-            num_image_text_embeds = 4
-            if state_dict["proj.weight"].shape[0] == 65536:
-                num_image_text_embeds = 16
-            clip_embeddings_dim = state_dict["proj.weight"].shape[-1]
-            cross_attention_dim = state_dict["proj.weight"].shape[0] // num_image_text_embeds
-
-            with init_context():
-                image_projection = ImageProjection(
-                    cross_attention_dim=cross_attention_dim,
-                    image_embed_dim=clip_embeddings_dim,
-                    num_image_text_embeds=num_image_text_embeds,
-                )
-
-            for key, value in state_dict.items():
-                diffusers_name = key.replace("proj", "image_embeds")
-                updated_state_dict[diffusers_name] = value
-
-        if not low_cpu_mem_usage:
-            image_projection.load_state_dict(updated_state_dict, strict=True)
-        else:
-            device_map = {"": self.device}
-            load_model_dict_into_meta(image_projection, updated_state_dict, device_map=device_map, dtype=self.dtype)
-
-        return image_projection
-
-    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
-        from ..models.attention_processor import (
-            FluxIPAdapterJointAttnProcessor2_0,
-        )
-
-        if low_cpu_mem_usage:
-            if is_accelerate_available():
-                from accelerate import init_empty_weights
-
-            else:
-                low_cpu_mem_usage = False
-                logger.warning(
-                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                    " install accelerate\n```\n."
-                )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        # set ip-adapter cross-attention processors & load state_dict
-        attn_procs = {}
-        key_id = 0
-        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
-        for name in self.attn_processors.keys():
-            if name.startswith("single_transformer_blocks"):
-                attn_processor_class = self.attn_processors[name].__class__
-                attn_procs[name] = attn_processor_class()
-            else:
-                cross_attention_dim = self.config.joint_attention_dim
-                hidden_size = self.inner_dim
-                attn_processor_class = FluxIPAdapterJointAttnProcessor2_0
-                num_image_text_embeds = []
-                for state_dict in state_dicts:
-                    if "proj.weight" in state_dict["image_proj"]:
-                        num_image_text_embed = 4
-                        if state_dict["image_proj"]["proj.weight"].shape[0] == 65536:
-                            num_image_text_embed = 16
-                        # IP-Adapter
-                        num_image_text_embeds += [num_image_text_embed]
-
-                with init_context():
-                    attn_procs[name] = attn_processor_class(
-                        hidden_size=hidden_size,
-                        cross_attention_dim=cross_attention_dim,
-                        scale=1.0,
-                        num_tokens=num_image_text_embeds,
-                        dtype=self.dtype,
-                        device=self.device,
-                    )
-
-                value_dict = {}
-                for i, state_dict in enumerate(state_dicts):
-                    value_dict.update({f"to_k_ip.{i}.weight": state_dict["ip_adapter"][f"{key_id}.to_k_ip.weight"]})
-                    value_dict.update({f"to_v_ip.{i}.weight": state_dict["ip_adapter"][f"{key_id}.to_v_ip.weight"]})
-                    value_dict.update({f"to_k_ip.{i}.bias": state_dict["ip_adapter"][f"{key_id}.to_k_ip.bias"]})
-                    value_dict.update({f"to_v_ip.{i}.bias": state_dict["ip_adapter"][f"{key_id}.to_v_ip.bias"]})
-
-                if not low_cpu_mem_usage:
-                    attn_procs[name].load_state_dict(value_dict)
-                else:
-                    device_map = {"": self.device}
-                    dtype = self.dtype
-                    load_model_dict_into_meta(attn_procs[name], value_dict, device_map=device_map, dtype=dtype)
-
-                key_id += 1
-
-        return attn_procs
-
-    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
-        if not isinstance(state_dicts, list):
-            state_dicts = [state_dicts]
-
-        self.encoder_hid_proj = None
-
-        attn_procs = self._convert_ip_adapter_attn_to_diffusers(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
-        self.set_attn_processor(attn_procs)
-
-        image_projection_layers = []
-        for state_dict in state_dicts:
-            image_projection_layer = self._convert_ip_adapter_image_proj_to_diffusers(
-                state_dict["image_proj"], low_cpu_mem_usage=low_cpu_mem_usage
-            )
-            image_projection_layers.append(image_projection_layer)
-
-        self.encoder_hid_proj = MultiIPAdapterImageProjection(image_projection_layers)
-        self.config.encoder_hid_dim_type = "ip_image_proj"
+class FluxTransformer2DLoadersMixin(FluxTransformer2DLoadersMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `FluxTransformer2DLoadersMixin` from diffusers.loaders.ip_adapter has been deprecated. Please use `from diffusers.loaders.ip_adapter.transformer_flux import FluxTransformer2DLoadersMixin` instead."
+        deprecate("diffusers.loaders.ip_adapter.FluxTransformer2DLoadersMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)
@@ -11,160 +11,12 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from contextlib import nullcontext
-from typing import Dict
-
-from ..models.attention_processor import SD3IPAdapterJointAttnProcessor2_0
-from ..models.embeddings import IPAdapterTimeImageProjection
-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
-from ..utils import is_accelerate_available, is_torch_version, logging
+from ..utils import deprecate
+from .ip_adapter.transformer_sd3 import SD3Transformer2DLoadersMixin


-logger = logging.get_logger(__name__)
-
-
-class SD3Transformer2DLoadersMixin:
-    """Load IP-Adapters and LoRA layers into a `[SD3Transformer2DModel]`."""
-
-    def _convert_ip_adapter_attn_to_diffusers(
-        self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT
-    ) -> Dict:
-        if low_cpu_mem_usage:
-            if is_accelerate_available():
-                from accelerate import init_empty_weights
-
-            else:
-                low_cpu_mem_usage = False
-                logger.warning(
-                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                    " install accelerate\n```\n."
-                )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        # IP-Adapter cross attention parameters
-        hidden_size = self.config.attention_head_dim * self.config.num_attention_heads
-        ip_hidden_states_dim = self.config.attention_head_dim * self.config.num_attention_heads
-        timesteps_emb_dim = state_dict["0.norm_ip.linear.weight"].shape[1]
-
-        # Dict where key is transformer layer index, value is attention processor's state dict
-        # ip_adapter state dict keys example: "0.norm_ip.linear.weight"
-        layer_state_dict = {idx: {} for idx in range(len(self.attn_processors))}
-        for key, weights in state_dict.items():
-            idx, name = key.split(".", maxsplit=1)
-            layer_state_dict[int(idx)][name] = weights
-
-        # Create IP-Adapter attention processor & load state_dict
-        attn_procs = {}
-        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
-        for idx, name in enumerate(self.attn_processors.keys()):
-            with init_context():
-                attn_procs[name] = SD3IPAdapterJointAttnProcessor2_0(
-                    hidden_size=hidden_size,
-                    ip_hidden_states_dim=ip_hidden_states_dim,
-                    head_dim=self.config.attention_head_dim,
-                    timesteps_emb_dim=timesteps_emb_dim,
-                )
-
-            if not low_cpu_mem_usage:
-                attn_procs[name].load_state_dict(layer_state_dict[idx], strict=True)
-            else:
-                device_map = {"": self.device}
-                load_model_dict_into_meta(
-                    attn_procs[name], layer_state_dict[idx], device_map=device_map, dtype=self.dtype
-                )
-
-        return attn_procs
-
-    def _convert_ip_adapter_image_proj_to_diffusers(
-        self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT
-    ) -> IPAdapterTimeImageProjection:
-        if low_cpu_mem_usage:
-            if is_accelerate_available():
-                from accelerate import init_empty_weights
-
-            else:
-                low_cpu_mem_usage = False
-                logger.warning(
-                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                    " install accelerate\n```\n."
-                )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
-
-        # Convert to diffusers
-        updated_state_dict = {}
-        for key, value in state_dict.items():
-            # InstantX/SD3.5-Large-IP-Adapter
-            if key.startswith("layers."):
-                idx = key.split(".")[1]
-                key = key.replace(f"layers.{idx}.0.norm1", f"layers.{idx}.ln0")
-                key = key.replace(f"layers.{idx}.0.norm2", f"layers.{idx}.ln1")
-                key = key.replace(f"layers.{idx}.0.to_q", f"layers.{idx}.attn.to_q")
-                key = key.replace(f"layers.{idx}.0.to_kv", f"layers.{idx}.attn.to_kv")
-                key = key.replace(f"layers.{idx}.0.to_out", f"layers.{idx}.attn.to_out.0")
-                key = key.replace(f"layers.{idx}.1.0", f"layers.{idx}.adaln_norm")
-                key = key.replace(f"layers.{idx}.1.1", f"layers.{idx}.ff.net.0.proj")
-                key = key.replace(f"layers.{idx}.1.3", f"layers.{idx}.ff.net.2")
-                key = key.replace(f"layers.{idx}.2.1", f"layers.{idx}.adaln_proj")
-            updated_state_dict[key] = value
-
-        # Image projetion parameters
-        embed_dim = updated_state_dict["proj_in.weight"].shape[1]
-        output_dim = updated_state_dict["proj_out.weight"].shape[0]
-        hidden_dim = updated_state_dict["proj_in.weight"].shape[0]
-        heads = updated_state_dict["layers.0.attn.to_q.weight"].shape[0] // 64
-        num_queries = updated_state_dict["latents"].shape[1]
-        timestep_in_dim = updated_state_dict["time_embedding.linear_1.weight"].shape[1]
-
-        # Image projection
-        with init_context():
-            image_proj = IPAdapterTimeImageProjection(
-                embed_dim=embed_dim,
-                output_dim=output_dim,
-                hidden_dim=hidden_dim,
-                heads=heads,
-                num_queries=num_queries,
-                timestep_in_dim=timestep_in_dim,
-            )
-
-        if not low_cpu_mem_usage:
-            image_proj.load_state_dict(updated_state_dict, strict=True)
-        else:
-            device_map = {"": self.device}
-            load_model_dict_into_meta(image_proj, updated_state_dict, device_map=device_map, dtype=self.dtype)
-
-        return image_proj
-
-    def _load_ip_adapter_weights(self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT) -> None:
-        """Sets IP-Adapter attention processors, image projection, and loads state_dict.
-
-        Args:
-            state_dict (`Dict`):
-                State dict with keys "ip_adapter", which contains parameters for attention processors, and
-                "image_proj", which contains parameters for image projection net.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-
-        attn_procs = self._convert_ip_adapter_attn_to_diffusers(state_dict["ip_adapter"], low_cpu_mem_usage)
-        self.set_attn_processor(attn_procs)
-
-        self.image_proj = self._convert_ip_adapter_image_proj_to_diffusers(state_dict["image_proj"], low_cpu_mem_usage)
+class SD3Transformer2DLoadersMixin(SD3Transformer2DLoadersMixin):
+    def __init__(self, *args, **kwargs):
+        deprecation_message = "Importing `SD3Transformer2DLoadersMixin` from diffusers.loaders.ip_adapter has been deprecated. Please use `from diffusers.loaders.ip_adapter.transformer_sd3 import SD3Transformer2DLoadersMixin` instead."
+        deprecate("diffusers.loaders.ip_adapter.SD3Transformer2DLoadersMixin", "0.36", deprecation_message)
+        super().__init__(*args, **kwargs)
@@ -0,0 +1,5 @@
+from ...utils import is_torch_available
+
+
+if is_torch_available():
+    from .unet import UNet2DConditionLoadersMixin
@@ -22,7 +22,7 @@ import torch
 import torch.nn.functional as F
 from huggingface_hub.utils import validate_hf_hub_args

-from ..models.embeddings import (
+from ...models.embeddings import (
    ImageProjection,
    IPAdapterFaceIDImageProjection,
    IPAdapterFaceIDPlusImageProjection,
@@ -30,8 +30,8 @@ from ..models.embeddings import (
    IPAdapterPlusImageProjection,
    MultiIPAdapterImageProjection,
 )
-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta, load_state_dict
-from ..utils import (
+from ...models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta, load_state_dict
+from ...utils import (
    USE_PEFT_BACKEND,
    _get_model_file,
    convert_unet_state_dict_to_peft,
@@ -43,9 +43,9 @@ from ..utils import (
    is_torch_version,
    logging,
 )
-from .lora_base import _func_optionally_disable_offloading
-from .lora_pipeline import LORA_WEIGHT_NAME, LORA_WEIGHT_NAME_SAFE, TEXT_ENCODER_NAME, UNET_NAME
-from .utils import AttnProcsLayers
+from ..lora.lora_base import _func_optionally_disable_offloading
+from ..lora.lora_pipeline import LORA_WEIGHT_NAME, LORA_WEIGHT_NAME_SAFE, TEXT_ENCODER_NAME, UNET_NAME
+from ..utils import AttnProcsLayers


 logger = logging.get_logger(__name__)
@@ -247,7 +247,7 @@ class UNet2DConditionLoadersMixin:
        # Unsafe code />

    def _process_custom_diffusion(self, state_dict):
-        from ..models.attention_processor import CustomDiffusionAttnProcessor
+        from ...models.attention_processor import CustomDiffusionAttnProcessor

        attn_processors = {}
        custom_diffusion_grouped_dict = defaultdict(dict)
@@ -395,7 +395,7 @@ class UNet2DConditionLoadersMixin:
        return is_model_cpu_offload, is_sequential_cpu_offload

    @classmethod
-    # Copied from diffusers.loaders.lora_base.LoraBaseMixin._optionally_disable_offloading
+    # Copied from diffusers.loaders.lora.lora_base.LoraBaseMixin._optionally_disable_offloading
    def _optionally_disable_offloading(cls, _pipeline):
        """
        Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU.
@@ -451,7 +451,7 @@ class UNet2DConditionLoadersMixin:
        pipeline.unet.save_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin")
        ```
        """
-        from ..models.attention_processor import (
+        from ...models.attention_processor import (
            CustomDiffusionAttnProcessor,
            CustomDiffusionAttnProcessor2_0,
            CustomDiffusionXFormersAttnProcessor,
@@ -513,7 +513,7 @@ class UNet2DConditionLoadersMixin:
        logger.info(f"Model weights saved in {save_path}")

    def _get_custom_diffusion_state_dict(self):
-        from ..models.attention_processor import (
+        from ...models.attention_processor import (
            CustomDiffusionAttnProcessor,
            CustomDiffusionAttnProcessor2_0,
            CustomDiffusionXFormersAttnProcessor,
@@ -759,7 +759,7 @@ class UNet2DConditionLoadersMixin:
        return image_projection

    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
-        from ..models.attention_processor import (
+        from ...models.attention_processor import (
            IPAdapterAttnProcessor,
            IPAdapterAttnProcessor2_0,
            IPAdapterXFormersAttnProcessor,
@@ -14,12 +14,12 @@
 import copy
 from typing import TYPE_CHECKING, Dict, List, Union

-from ..utils import logging
+from ...utils import logging


 if TYPE_CHECKING:
    # import here to avoid circular imports
-    from ..models import UNet2DConditionModel
+    from ...models import UNet2DConditionModel

 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name

@@ -17,8 +17,7 @@ import torch
 import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import PeftAdapterMixin
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
 from ...utils import deprecate
 from ...utils.accelerate_utils import apply_forward_hook
 from ..attention_processor import (
@@ -21,7 +21,7 @@ import torch.nn as nn
 import torch.nn.functional as F

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin
 from ...utils import logging
 from ...utils.accelerate_utils import apply_forward_hook
 from ..activations import get_activation
@@ -19,7 +19,7 @@ from torch import nn
 from torch.nn import functional as F

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin
 from ...utils import BaseOutput, logging
 from ..attention_processor import (
    ADDED_KV_ATTENTION_PROCESSORS,
@@ -20,12 +20,12 @@ import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
 from ...loaders import PeftAdapterMixin
-from ...models.attention_processor import AttentionProcessor
-from ...models.modeling_utils import ModelMixin
 from ...utils import USE_PEFT_BACKEND, BaseOutput, logging, scale_lora_layers, unscale_lora_layers
+from ..attention_processor import AttentionProcessor
 from ..controlnets.controlnet import ControlNetConditioningEmbedding, zero_module
 from ..embeddings import CombinedTimestepGuidanceTextProjEmbeddings, CombinedTimestepTextProjEmbeddings, FluxPosEmbed
 from ..modeling_outputs import Transformer2DModelOutput
+from ..modeling_utils import ModelMixin
 from ..transformers.transformer_flux import FluxSingleTransformerBlock, FluxTransformerBlock


@@ -17,7 +17,7 @@ import torch
 from torch import nn

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin
 from ...utils import logging
 from ..attention_processor import (
    ADDED_KV_ATTENTION_PROCESSORS,
@@ -4,9 +4,9 @@ from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 import torch
 from torch import nn

-from ...models.controlnets.controlnet import ControlNetModel, ControlNetOutput
-from ...models.modeling_utils import ModelMixin
 from ...utils import logging
+from ..controlnets.controlnet import ControlNetModel, ControlNetOutput
+from ..modeling_utils import ModelMixin


 logger = logging.get_logger(__name__)
@@ -4,10 +4,10 @@ from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 import torch
 from torch import nn

-from ...models.controlnets.controlnet import ControlNetOutput
-from ...models.controlnets.controlnet_union import ControlNetUnionModel
-from ...models.modeling_utils import ModelMixin
 from ...utils import logging
+from ..controlnets.controlnet import ControlNetOutput
+from ..controlnets.controlnet_union import ControlNetUnionModel
+from ..modeling_utils import ModelMixin


 logger = logging.get_logger(__name__)
@@ -286,7 +286,7 @@ class KDownsample2D(nn.Module):


 class CogVideoXDownsample3D(nn.Module):
-    # Todo: Wait for paper relase.
+    # Todo: Wait for paper release.
    r"""
    A 3D Downsampling layer using in [CogVideoX]() by Tsinghua University & ZhipuAI

@@ -18,10 +18,9 @@ import torch
 from torch import nn

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...models.embeddings import PixArtAlphaTextProjection, get_1d_sincos_pos_embed_from_grid
 from ..attention import BasicTransformerBlock
 from ..cache_utils import CacheMixin
-from ..embeddings import PatchEmbed
+from ..embeddings import PatchEmbed, PixArtAlphaTextProjection, get_1d_sincos_pos_embed_from_grid
 from ..modeling_outputs import Transformer2DModelOutput
 from ..modeling_utils import ModelMixin
 from ..normalization import AdaLayerNormSingle
@@ -21,16 +21,12 @@ import torch.nn as nn
 import torch.utils.checkpoint

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...models.attention import FeedForward
-from ...models.attention_processor import (
-    Attention,
-    AttentionProcessor,
-    StableAudioAttnProcessor2_0,
-)
-from ...models.modeling_utils import ModelMixin
-from ...models.transformers.transformer_2d import Transformer2DModelOutput
 from ...utils import logging
 from ...utils.torch_utils import maybe_allow_in_graph
+from ..attention import FeedForward
+from ..attention_processor import Attention, AttentionProcessor, StableAudioAttnProcessor2_0
+from ..modeling_utils import ModelMixin
+from ..transformers.transformer_2d import Transformer2DModelOutput


 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
@@ -19,18 +19,13 @@ import torch
 import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...models.attention import FeedForward
-from ...models.attention_processor import (
-    Attention,
-    AttentionProcessor,
-    CogVideoXAttnProcessor2_0,
-)
-from ...models.modeling_utils import ModelMixin
-from ...models.normalization import AdaLayerNormContinuous
 from ...utils import logging
+from ..attention import FeedForward
+from ..attention_processor import Attention, AttentionProcessor, CogVideoXAttnProcessor2_0
 from ..embeddings import CogView3CombinedTimestepSizeEmbeddings, CogView3PlusPatchEmbed
 from ..modeling_outputs import Transformer2DModelOutput
-from ..normalization import CogView3PlusAdaLayerNormZeroTextImage
+from ..modeling_utils import ModelMixin
+from ..normalization import AdaLayerNormContinuous, CogView3PlusAdaLayerNormZeroTextImage


 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from typing import Any, Dict, Optional, Tuple, Union
+from typing import Any, Dict, List, Optional, Tuple, Union

 import torch
 import torch.nn as nn
@@ -73,8 +73,9 @@ class CogView4AdaLayerNormZero(nn.Module):
    def forward(
        self, hidden_states: torch.Tensor, encoder_hidden_states: torch.Tensor, temb: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        norm_hidden_states = self.norm(hidden_states)
-        norm_encoder_hidden_states = self.norm_context(encoder_hidden_states)
+        dtype = hidden_states.dtype
+        norm_hidden_states = self.norm(hidden_states).to(dtype=dtype)
+        norm_encoder_hidden_states = self.norm_context(encoder_hidden_states).to(dtype=dtype)

        emb = self.linear(temb)
        (
@@ -111,8 +112,11 @@ class CogView4AdaLayerNormZero(nn.Module):

 class CogView4AttnProcessor:
    """
-    Processor for implementing scaled dot-product attention for the CogVideoX model. It applies a rotary embedding on
+    Processor for implementing scaled dot-product attention for the CogView4 model. It applies a rotary embedding on
    query and key vectors, but does not include spatial normalization.
+
+    The processor supports passing an attention mask for text tokens. The attention mask should have shape (batch_size,
+    text_seq_length) where 1 indicates a non-padded token and 0 indicates a padded token.
    """

    def __init__(self):
@@ -125,8 +129,10 @@ class CogView4AttnProcessor:
        hidden_states: torch.Tensor,
        encoder_hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
-        image_rotary_emb: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
+        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        dtype = encoder_hidden_states.dtype
+
        batch_size, text_seq_length, embed_dim = encoder_hidden_states.shape
        batch_size, image_seq_length, embed_dim = hidden_states.shape
        hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
@@ -142,9 +148,9 @@ class CogView4AttnProcessor:

        # 2. QK normalization
        if attn.norm_q is not None:
-            query = attn.norm_q(query)
+            query = attn.norm_q(query).to(dtype=dtype)
        if attn.norm_k is not None:
-            key = attn.norm_k(key)
+            key = attn.norm_k(key).to(dtype=dtype)

        # 3. Rotational positional embeddings applied to latent stream
        if image_rotary_emb is not None:
@@ -159,13 +165,14 @@ class CogView4AttnProcessor:

        # 4. Attention
        if attention_mask is not None:
-            text_attention_mask = attention_mask.float().to(query.device)
-            actual_text_seq_length = text_attention_mask.size(1)
-            new_attention_mask = torch.zeros((batch_size, text_seq_length + image_seq_length), device=query.device)
-            new_attention_mask[:, :actual_text_seq_length] = text_attention_mask
-            new_attention_mask = new_attention_mask.unsqueeze(2)
-            attention_mask_matrix = new_attention_mask @ new_attention_mask.transpose(1, 2)
-            attention_mask = (attention_mask_matrix > 0).unsqueeze(1).to(query.dtype)
+            text_attn_mask = attention_mask
+            assert text_attn_mask.dim() == 2, "the shape of text_attn_mask should be (batch_size, text_seq_length)"
+            text_attn_mask = text_attn_mask.float().to(query.device)
+            mix_attn_mask = torch.ones((batch_size, text_seq_length + image_seq_length), device=query.device)
+            mix_attn_mask[:, :text_seq_length] = text_attn_mask
+            mix_attn_mask = mix_attn_mask.unsqueeze(2)
+            attn_mask_matrix = mix_attn_mask @ mix_attn_mask.transpose(1, 2)
+            attention_mask = (attn_mask_matrix > 0).unsqueeze(1).to(query.dtype)

        hidden_states = F.scaled_dot_product_attention(
            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
@@ -183,9 +190,276 @@ class CogView4AttnProcessor:
        return hidden_states, encoder_hidden_states


+class CogView4TrainingAttnProcessor:
+    """
+    Training Processor for implementing scaled dot-product attention for the CogView4 model. It applies a rotary
+    embedding on query and key vectors, but does not include spatial normalization.
+
+    This processor differs from CogView4AttnProcessor in several important ways:
+    1. It supports attention masking with variable sequence lengths for multi-resolution training
+    2. It unpacks and repacks sequences for efficient training with variable sequence lengths when batch_flag is
+       provided
+    """
+
+    def __init__(self):
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError("CogView4AttnProcessor requires PyTorch 2.0. To use it, please upgrade PyTorch to 2.0.")
+
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor,
+        latent_attn_mask: Optional[torch.Tensor] = None,
+        text_attn_mask: Optional[torch.Tensor] = None,
+        batch_flag: Optional[torch.Tensor] = None,
+        image_rotary_emb: Optional[
+            Union[Tuple[torch.Tensor, torch.Tensor], List[Tuple[torch.Tensor, torch.Tensor]]]
+        ] = None,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Args:
+            attn (`Attention`):
+                The attention module.
+            hidden_states (`torch.Tensor`):
+                The input hidden states.
+            encoder_hidden_states (`torch.Tensor`):
+                The encoder hidden states for cross-attention.
+            latent_attn_mask (`torch.Tensor`, *optional*):
+                Mask for latent tokens where 0 indicates pad token and 1 indicates non-pad token. If None, full
+                attention is used for all latent tokens. Note: the shape of latent_attn_mask is (batch_size,
+                num_latent_tokens).
+            text_attn_mask (`torch.Tensor`, *optional*):
+                Mask for text tokens where 0 indicates pad token and 1 indicates non-pad token. If None, full attention
+                is used for all text tokens.
+            batch_flag (`torch.Tensor`, *optional*):
+                Values from 0 to n-1 indicating which samples belong to the same batch. Samples with the same
+                batch_flag are packed together. Example: [0, 1, 1, 2, 2] means sample 0 forms batch0, samples 1-2 form
+                batch1, and samples 3-4 form batch2. If None, no packing is used.
+            image_rotary_emb (`Tuple[torch.Tensor, torch.Tensor]` or `list[Tuple[torch.Tensor, torch.Tensor]]`, *optional*):
+                The rotary embedding for the image part of the input.
+        Returns:
+            `Tuple[torch.Tensor, torch.Tensor]`: The processed hidden states for both image and text streams.
+        """
+
+        # Get dimensions and device info
+        batch_size, text_seq_length, embed_dim = encoder_hidden_states.shape
+        batch_size, image_seq_length, embed_dim = hidden_states.shape
+        dtype = encoder_hidden_states.dtype
+        device = encoder_hidden_states.device
+        latent_hidden_states = hidden_states
+        # Combine text and image streams for joint processing
+        mixed_hidden_states = torch.cat([encoder_hidden_states, latent_hidden_states], dim=1)
+
+        # 1. Construct attention mask and maybe packing input
+        # Create default masks if not provided
+        if text_attn_mask is None:
+            text_attn_mask = torch.ones((batch_size, text_seq_length), dtype=torch.int32, device=device)
+        if latent_attn_mask is None:
+            latent_attn_mask = torch.ones((batch_size, image_seq_length), dtype=torch.int32, device=device)
+
+        # Validate mask shapes and types
+        assert text_attn_mask.dim() == 2, "the shape of text_attn_mask should be (batch_size, text_seq_length)"
+        assert text_attn_mask.dtype == torch.int32, "the dtype of text_attn_mask should be torch.int32"
+        assert latent_attn_mask.dim() == 2, "the shape of latent_attn_mask should be (batch_size, num_latent_tokens)"
+        assert latent_attn_mask.dtype == torch.int32, "the dtype of latent_attn_mask should be torch.int32"
+
+        # Create combined mask for text and image tokens
+        mixed_attn_mask = torch.ones(
+            (batch_size, text_seq_length + image_seq_length), dtype=torch.int32, device=device
+        )
+        mixed_attn_mask[:, :text_seq_length] = text_attn_mask
+        mixed_attn_mask[:, text_seq_length:] = latent_attn_mask
+
+        # Convert mask to attention matrix format (where 1 means attend, 0 means don't attend)
+        mixed_attn_mask_input = mixed_attn_mask.unsqueeze(2).to(dtype=dtype)
+        attn_mask_matrix = mixed_attn_mask_input @ mixed_attn_mask_input.transpose(1, 2)
+
+        # Handle batch packing if enabled
+        if batch_flag is not None:
+            assert batch_flag.dim() == 1
+            # Determine packed batch size based on batch_flag
+            packing_batch_size = torch.max(batch_flag).item() + 1
+
+            # Calculate actual sequence lengths for each sample based on masks
+            text_seq_length = torch.sum(text_attn_mask, dim=1)
+            latent_seq_length = torch.sum(latent_attn_mask, dim=1)
+            mixed_seq_length = text_seq_length + latent_seq_length
+
+            # Calculate packed sequence lengths for each packed batch
+            mixed_seq_length_packed = [
+                torch.sum(mixed_attn_mask[batch_flag == batch_idx]).item() for batch_idx in range(packing_batch_size)
+            ]
+
+            assert len(mixed_seq_length_packed) == packing_batch_size
+
+            # Pack sequences by removing padding tokens
+            mixed_attn_mask_flatten = mixed_attn_mask.flatten(0, 1)
+            mixed_hidden_states_flatten = mixed_hidden_states.flatten(0, 1)
+            mixed_hidden_states_unpad = mixed_hidden_states_flatten[mixed_attn_mask_flatten == 1]
+            assert torch.sum(mixed_seq_length) == mixed_hidden_states_unpad.shape[0]
+
+            # Split the unpadded sequence into packed batches
+            mixed_hidden_states_packed = torch.split(mixed_hidden_states_unpad, mixed_seq_length_packed)
+
+            # Re-pad to create packed batches with right-side padding
+            mixed_hidden_states_packed_padded = torch.nn.utils.rnn.pad_sequence(
+                mixed_hidden_states_packed,
+                batch_first=True,
+                padding_value=0.0,
+                padding_side="right",
+            )
+
+            # Create attention mask for packed batches
+            l = mixed_hidden_states_packed_padded.shape[1]
+            attn_mask_matrix = torch.zeros(
+                (packing_batch_size, l, l),
+                dtype=dtype,
+                device=device,
+            )
+
+            # Fill attention mask with block diagonal matrices
+            # This ensures that tokens can only attend to other tokens within the same original sample
+            for idx, mask in enumerate(attn_mask_matrix):
+                seq_lengths = mixed_seq_length[batch_flag == idx]
+                offset = 0
+                for length in seq_lengths:
+                    # Create a block of 1s for each sample in the packed batch
+                    mask[offset : offset + length, offset : offset + length] = 1
+                    offset += length
+
+        attn_mask_matrix = attn_mask_matrix.to(dtype=torch.bool)
+        attn_mask_matrix = attn_mask_matrix.unsqueeze(1)  # Add attention head dim
+        attention_mask = attn_mask_matrix
+
+        # Prepare hidden states for attention computation
+        if batch_flag is None:
+            # If no packing, just combine text and image tokens
+            hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
+        else:
+            # If packing, use the packed sequence
+            hidden_states = mixed_hidden_states_packed_padded
+
+        # 2. QKV projections - convert hidden states to query, key, value
+        query = attn.to_q(hidden_states)
+        key = attn.to_k(hidden_states)
+        value = attn.to_v(hidden_states)
+
+        # Reshape for multi-head attention: [batch, seq_len, heads*dim] -> [batch, heads, seq_len, dim]
+        query = query.unflatten(2, (attn.heads, -1)).transpose(1, 2)
+        key = key.unflatten(2, (attn.heads, -1)).transpose(1, 2)
+        value = value.unflatten(2, (attn.heads, -1)).transpose(1, 2)
+
+        # 3. QK normalization - apply layer norm to queries and keys if configured
+        if attn.norm_q is not None:
+            query = attn.norm_q(query).to(dtype=dtype)
+        if attn.norm_k is not None:
+            key = attn.norm_k(key).to(dtype=dtype)
+
+        # 4. Apply rotary positional embeddings to image tokens only
+        if image_rotary_emb is not None:
+            from ..embeddings import apply_rotary_emb
+
+            if batch_flag is None:
+                # Apply RoPE only to image tokens (after text tokens)
+                query[:, :, text_seq_length:, :] = apply_rotary_emb(
+                    query[:, :, text_seq_length:, :], image_rotary_emb, use_real_unbind_dim=-2
+                )
+                key[:, :, text_seq_length:, :] = apply_rotary_emb(
+                    key[:, :, text_seq_length:, :], image_rotary_emb, use_real_unbind_dim=-2
+                )
+            else:
+                # For packed batches, need to carefully apply RoPE to appropriate tokens
+                assert query.shape[0] == packing_batch_size
+                assert key.shape[0] == packing_batch_size
+                assert len(image_rotary_emb) == batch_size
+
+                rope_idx = 0
+                for idx in range(packing_batch_size):
+                    offset = 0
+                    # Get text and image sequence lengths for samples in this packed batch
+                    text_seq_length_bi = text_seq_length[batch_flag == idx]
+                    latent_seq_length_bi = latent_seq_length[batch_flag == idx]
+
+                    # Apply RoPE to each image segment in the packed sequence
+                    for tlen, llen in zip(text_seq_length_bi, latent_seq_length_bi):
+                        mlen = tlen + llen
+                        # Apply RoPE only to image tokens (after text tokens)
+                        query[idx, :, offset + tlen : offset + mlen, :] = apply_rotary_emb(
+                            query[idx, :, offset + tlen : offset + mlen, :],
+                            image_rotary_emb[rope_idx],
+                            use_real_unbind_dim=-2,
+                        )
+                        key[idx, :, offset + tlen : offset + mlen, :] = apply_rotary_emb(
+                            key[idx, :, offset + tlen : offset + mlen, :],
+                            image_rotary_emb[rope_idx],
+                            use_real_unbind_dim=-2,
+                        )
+                        offset += mlen
+                        rope_idx += 1
+
+        hidden_states = F.scaled_dot_product_attention(
+            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
+        )
+
+        # Reshape back: [batch, heads, seq_len, dim] -> [batch, seq_len, heads*dim]
+        hidden_states = hidden_states.transpose(1, 2).flatten(2, 3)
+        hidden_states = hidden_states.type_as(query)
+
+        # 5. Output projection - project attention output to model dimension
+        hidden_states = attn.to_out[0](hidden_states)
+        hidden_states = attn.to_out[1](hidden_states)
+
+        # Split the output back into text and image streams
+        if batch_flag is None:
+            # Simple split for non-packed case
+            encoder_hidden_states, hidden_states = hidden_states.split(
+                [text_seq_length, hidden_states.size(1) - text_seq_length], dim=1
+            )
+        else:
+            # For packed case: need to unpack, split text/image, then restore to original shapes
+            # First, unpad the sequence based on the packed sequence lengths
+            hidden_states_unpad = torch.nn.utils.rnn.unpad_sequence(
+                hidden_states,
+                lengths=torch.tensor(mixed_seq_length_packed),
+                batch_first=True,
+            )
+            # Concatenate all unpadded sequences
+            hidden_states_flatten = torch.cat(hidden_states_unpad, dim=0)
+            # Split by original sample sequence lengths
+            hidden_states_unpack = torch.split(hidden_states_flatten, mixed_seq_length.tolist())
+            assert len(hidden_states_unpack) == batch_size
+
+            # Further split each sample's sequence into text and image parts
+            hidden_states_unpack = [
+                torch.split(h, [tlen, llen])
+                for h, tlen, llen in zip(hidden_states_unpack, text_seq_length, latent_seq_length)
+            ]
+            # Separate text and image sequences
+            encoder_hidden_states_unpad = [h[0] for h in hidden_states_unpack]
+            hidden_states_unpad = [h[1] for h in hidden_states_unpack]
+
+            # Update the original tensors with the processed values, respecting the attention masks
+            for idx in range(batch_size):
+                # Place unpacked text tokens back in the encoder_hidden_states tensor
+                encoder_hidden_states[idx][text_attn_mask[idx] == 1] = encoder_hidden_states_unpad[idx]
+                # Place unpacked image tokens back in the latent_hidden_states tensor
+                latent_hidden_states[idx][latent_attn_mask[idx] == 1] = hidden_states_unpad[idx]
+
+            # Update the output hidden states
+            hidden_states = latent_hidden_states
+
+        return hidden_states, encoder_hidden_states
+
+
 class CogView4TransformerBlock(nn.Module):
    def __init__(
-        self, dim: int = 2560, num_attention_heads: int = 64, attention_head_dim: int = 40, time_embed_dim: int = 512
+        self,
+        dim: int = 2560,
+        num_attention_heads: int = 64,
+        attention_head_dim: int = 40,
+        time_embed_dim: int = 512,
    ) -> None:
        super().__init__()

@@ -213,9 +487,11 @@ class CogView4TransformerBlock(nn.Module):
        hidden_states: torch.Tensor,
        encoder_hidden_states: torch.Tensor,
        temb: Optional[torch.Tensor] = None,
-        image_rotary_emb: Optional[torch.Tensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        **kwargs,
+        image_rotary_emb: Optional[
+            Union[Tuple[torch.Tensor, torch.Tensor], List[Tuple[torch.Tensor, torch.Tensor]]]
+        ] = None,
+        attention_mask: Optional[Dict[str, torch.Tensor]] = None,
+        attention_kwargs: Optional[Dict[str, Any]] = None,
    ) -> torch.Tensor:
        # 1. Timestep conditioning
        (
@@ -232,12 +508,14 @@ class CogView4TransformerBlock(nn.Module):
        ) = self.norm1(hidden_states, encoder_hidden_states, temb)

        # 2. Attention
+        if attention_kwargs is None:
+            attention_kwargs = {}
        attn_hidden_states, attn_encoder_hidden_states = self.attn1(
            hidden_states=norm_hidden_states,
            encoder_hidden_states=norm_encoder_hidden_states,
            image_rotary_emb=image_rotary_emb,
            attention_mask=attention_mask,
-            **kwargs,
+            **attention_kwargs,
        )
        hidden_states = hidden_states + attn_hidden_states * gate_msa.unsqueeze(1)
        encoder_hidden_states = encoder_hidden_states + attn_encoder_hidden_states * c_gate_msa.unsqueeze(1)
@@ -402,7 +680,9 @@ class CogView4Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
        attention_kwargs: Optional[Dict[str, Any]] = None,
        return_dict: bool = True,
        attention_mask: Optional[torch.Tensor] = None,
-        **kwargs,
+        image_rotary_emb: Optional[
+            Union[Tuple[torch.Tensor, torch.Tensor], List[Tuple[torch.Tensor, torch.Tensor]]]
+        ] = None,
    ) -> Union[torch.Tensor, Transformer2DModelOutput]:
        if attention_kwargs is not None:
            attention_kwargs = attention_kwargs.copy()
@@ -422,7 +702,8 @@ class CogView4Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
        batch_size, num_channels, height, width = hidden_states.shape

        # 1. RoPE
-        image_rotary_emb = self.rope(hidden_states)
+        if image_rotary_emb is None:
+            image_rotary_emb = self.rope(hidden_states)

        # 2. Patch & Timestep embeddings
        p = self.config.patch_size
@@ -438,11 +719,22 @@ class CogView4Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
        for block in self.transformer_blocks:
            if torch.is_grad_enabled() and self.gradient_checkpointing:
                hidden_states, encoder_hidden_states = self._gradient_checkpointing_func(
-                    block, hidden_states, encoder_hidden_states, temb, image_rotary_emb, attention_mask, **kwargs
+                    block,
+                    hidden_states,
+                    encoder_hidden_states,
+                    temb,
+                    image_rotary_emb,
+                    attention_mask,
+                    attention_kwargs,
                )
            else:
                hidden_states, encoder_hidden_states = block(
-                    hidden_states, encoder_hidden_states, temb, image_rotary_emb, attention_mask, **kwargs
+                    hidden_states,
+                    encoder_hidden_states,
+                    temb,
+                    image_rotary_emb,
+                    attention_mask,
+                    attention_kwargs,
                )

        # 4. Output norm & projection
@@ -21,22 +21,22 @@ import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
 from ...loaders import FluxTransformer2DLoadersMixin, FromOriginalModelMixin, PeftAdapterMixin
-from ...models.attention import FeedForward
-from ...models.attention_processor import (
+from ...utils import USE_PEFT_BACKEND, deprecate, logging, scale_lora_layers, unscale_lora_layers
+from ...utils.import_utils import is_torch_npu_available
+from ...utils.torch_utils import maybe_allow_in_graph
+from ..attention import FeedForward
+from ..attention_processor import (
    Attention,
    AttentionProcessor,
    FluxAttnProcessor2_0,
    FluxAttnProcessor2_0_NPU,
    FusedFluxAttnProcessor2_0,
 )
-from ...models.modeling_utils import ModelMixin
-from ...models.normalization import AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle
-from ...utils import USE_PEFT_BACKEND, deprecate, logging, scale_lora_layers, unscale_lora_layers
-from ...utils.import_utils import is_torch_npu_available
-from ...utils.torch_utils import maybe_allow_in_graph
 from ..cache_utils import CacheMixin
 from ..embeddings import CombinedTimestepGuidanceTextProjEmbeddings, CombinedTimestepTextProjEmbeddings, FluxPosEmbed
 from ..modeling_outputs import Transformer2DModelOutput
+from ..modeling_utils import ModelMixin
+from ..normalization import AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle


 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
@@ -275,7 +275,14 @@ class HiDreamAttnProcessor:

 # Modified from https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py
 class MoEGate(nn.Module):
-    def __init__(self, embed_dim, num_routed_experts=4, num_activated_experts=2, aux_loss_alpha=0.01):
+    def __init__(
+        self,
+        embed_dim,
+        num_routed_experts=4,
+        num_activated_experts=2,
+        aux_loss_alpha=0.01,
+        _force_inference_output=False,
+    ):
        super().__init__()
        self.top_k = num_activated_experts
        self.n_routed_experts = num_routed_experts
@@ -289,9 +296,10 @@ class MoEGate(nn.Module):
        self.gating_dim = embed_dim
        self.weight = nn.Parameter(torch.randn(self.n_routed_experts, self.gating_dim) / embed_dim**0.5)

+        self._force_inference_output = _force_inference_output
+
    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape
-        # print(bsz, seq_len, h)
        ### compute gating score
        hidden_states = hidden_states.view(-1, h)
        logits = F.linear(hidden_states, self.weight, None)
@@ -309,7 +317,7 @@ class MoEGate(nn.Module):
            topk_weight = topk_weight / denominator

        ### expert-level computation auxiliary loss
-        if self.training and self.alpha > 0.0:
+        if self.training and self.alpha > 0.0 and not self._force_inference_output:
            scores_for_aux = scores
            aux_topk = self.top_k
            # always compute aux loss based on the naive greedy topk method
@@ -341,14 +349,19 @@ class MOEFeedForwardSwiGLU(nn.Module):
        hidden_dim: int,
        num_routed_experts: int,
        num_activated_experts: int,
+        _force_inference_output: bool = False,
    ):
        super().__init__()
        self.shared_experts = HiDreamImageFeedForwardSwiGLU(dim, hidden_dim // 2)
        self.experts = nn.ModuleList(
            [HiDreamImageFeedForwardSwiGLU(dim, hidden_dim) for i in range(num_routed_experts)]
        )
+        self._force_inference_output = _force_inference_output
        self.gate = MoEGate(
-            embed_dim=dim, num_routed_experts=num_routed_experts, num_activated_experts=num_activated_experts
+            embed_dim=dim,
+            num_routed_experts=num_routed_experts,
+            num_activated_experts=num_activated_experts,
+            _force_inference_output=_force_inference_output,
        )
        self.num_activated_experts = num_activated_experts

@@ -359,7 +372,7 @@ class MOEFeedForwardSwiGLU(nn.Module):
        topk_idx, topk_weight, aux_loss = self.gate(x)
        x = x.view(-1, x.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
-        if self.training:
+        if self.training and not self._force_inference_output:
            x = x.repeat_interleave(self.num_activated_experts, dim=0)
            y = torch.empty_like(x, dtype=wtype)
            for i, expert in enumerate(self.experts):
@@ -413,6 +426,7 @@ class HiDreamImageSingleTransformerBlock(nn.Module):
        attention_head_dim: int,
        num_routed_experts: int = 4,
        num_activated_experts: int = 2,
+        _force_inference_output: bool = False,
    ):
        super().__init__()
        self.num_attention_heads = num_attention_heads
@@ -436,6 +450,7 @@ class HiDreamImageSingleTransformerBlock(nn.Module):
                hidden_dim=4 * dim,
                num_routed_experts=num_routed_experts,
                num_activated_experts=num_activated_experts,
+                _force_inference_output=_force_inference_output,
            )
        else:
            self.ff_i = HiDreamImageFeedForwardSwiGLU(dim=dim, hidden_dim=4 * dim)
@@ -480,6 +495,7 @@ class HiDreamImageTransformerBlock(nn.Module):
        attention_head_dim: int,
        num_routed_experts: int = 4,
        num_activated_experts: int = 2,
+        _force_inference_output: bool = False,
    ):
        super().__init__()
        self.num_attention_heads = num_attention_heads
@@ -504,6 +520,7 @@ class HiDreamImageTransformerBlock(nn.Module):
                hidden_dim=4 * dim,
                num_routed_experts=num_routed_experts,
                num_activated_experts=num_activated_experts,
+                _force_inference_output=_force_inference_output,
            )
        else:
            self.ff_i = HiDreamImageFeedForwardSwiGLU(dim=dim, hidden_dim=4 * dim)
@@ -606,6 +623,7 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
        axes_dims_rope: Tuple[int, int] = (32, 32),
        max_resolution: Tuple[int, int] = (128, 128),
        llama_layers: List[int] = None,
+        force_inference_output: bool = False,
    ):
        super().__init__()
        self.out_channels = out_channels or in_channels
@@ -629,6 +647,7 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
                        attention_head_dim=attention_head_dim,
                        num_routed_experts=num_routed_experts,
                        num_activated_experts=num_activated_experts,
+                        _force_inference_output=force_inference_output,
                    )
                )
                for _ in range(num_layers)
@@ -644,6 +663,7 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
                        attention_head_dim=attention_head_dim,
                        num_routed_experts=num_routed_experts,
                        num_activated_experts=num_activated_experts,
+                        _force_inference_output=force_inference_output,
                    )
                )
                for _ in range(num_single_layers)
@@ -662,7 +682,7 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
        self.gradient_checkpointing = False

    def unpatchify(self, x: torch.Tensor, img_sizes: List[Tuple[int, int]], is_training: bool) -> List[torch.Tensor]:
-        if is_training:
+        if is_training and not self.config.force_inference_output:
            B, S, F = x.shape
            C = F // (self.config.patch_size * self.config.patch_size)
            x = (
@@ -771,7 +791,7 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):

        if encoder_hidden_states is not None:
            deprecation_message = "The `encoder_hidden_states` argument is deprecated. Please use `encoder_hidden_states_t5` and `encoder_hidden_states_llama3` instead."
-            deprecate("encoder_hidden_states", "0.34.0", deprecation_message)
+            deprecate("encoder_hidden_states", "0.35.0", deprecation_message)
            encoder_hidden_states_t5 = encoder_hidden_states[0]
            encoder_hidden_states_llama3 = encoder_hidden_states[1]

@@ -779,7 +799,7 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
            deprecation_message = (
                "Passing `img_ids` and `img_sizes` with unpachified `hidden_states` is deprecated and will be ignored."
            )
-            deprecate("img_ids", "0.34.0", deprecation_message)
+            deprecate("img_ids", "0.35.0", deprecation_message)

        if hidden_states_masks is not None and (img_ids is None or img_sizes is None):
            raise ValueError("if `hidden_states_masks` is passed, `img_ids` and `img_sizes` must also be passed.")
@@ -20,8 +20,7 @@ import torch.nn as nn
 import torch.nn.functional as F

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import PeftAdapterMixin
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
 from ...utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
 from ..attention import LuminaFeedForward
 from ..attention_processor import Attention
@@ -19,8 +19,7 @@ import torch
 import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import PeftAdapterMixin
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
 from ...utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
 from ...utils.torch_utils import maybe_allow_in_graph
 from ..attention import FeedForward
@@ -18,19 +18,19 @@ import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
 from ...loaders import FromOriginalModelMixin, PeftAdapterMixin, SD3Transformer2DLoadersMixin
-from ...models.attention import FeedForward, JointTransformerBlock
-from ...models.attention_processor import (
+from ...utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
+from ...utils.torch_utils import maybe_allow_in_graph
+from ..attention import FeedForward, JointTransformerBlock
+from ..attention_processor import (
    Attention,
    AttentionProcessor,
    FusedJointAttnProcessor2_0,
    JointAttnProcessor2_0,
 )
-from ...models.modeling_utils import ModelMixin
-from ...models.normalization import AdaLayerNormContinuous, AdaLayerNormZero
-from ...utils import USE_PEFT_BACKEND, logging, scale_lora_layers, unscale_lora_layers
-from ...utils.torch_utils import maybe_allow_in_graph
 from ..embeddings import CombinedTimestepTextProjEmbeddings, PatchEmbed
 from ..modeling_outputs import Transformer2DModelOutput
+from ..modeling_utils import ModelMixin
+from ..normalization import AdaLayerNormContinuous, AdaLayerNormZero


 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
@@ -19,8 +19,7 @@ import torch.nn as nn
 import torch.utils.checkpoint

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import PeftAdapterMixin, UNet2DConditionLoadersMixin
-from ...loaders.single_file_model import FromOriginalModelMixin
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin, UNet2DConditionLoadersMixin
 from ...utils import USE_PEFT_BACKEND, BaseOutput, deprecate, logging, scale_lora_layers, unscale_lora_layers
 from ..activations import get_activation
 from ..attention_processor import (
@@ -358,7 +358,7 @@ class KUpsample2D(nn.Module):

 class CogVideoXUpsample3D(nn.Module):
    r"""
-    A 3D Upsample layer using in CogVideoX by Tsinghua University & ZhipuAI # Todo: Wait for paper relase.
+    A 3D Upsample layer using in CogVideoX by Tsinghua University & ZhipuAI # Todo: Wait for paper release.

    Args:
        in_channels (`int`):
@@ -514,7 +514,7 @@ class AllegroPipeline(DiffusionPipeline):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -484,7 +484,7 @@ class IFPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixin):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -528,7 +528,7 @@ class IFImg2ImgPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixin):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -281,7 +281,7 @@ class IFImg2ImgSuperResolutionPipeline(DiffusionPipeline, StableDiffusionLoraLoa
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -568,7 +568,7 @@ class IFInpaintingPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixin):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -283,7 +283,7 @@ class IFInpaintingSuperResolutionPipeline(DiffusionPipeline, StableDiffusionLora
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -239,7 +239,7 @@ class IFSuperResolutionPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixi
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -574,7 +574,7 @@ class StableDiffusionModelEditingPipeline(
                idxs_replace.append(76)
            idxs_replaces.append(idxs_replace)

-        # prepare batch: for each pair of setences, old context and new values
+        # prepare batch: for each pair of sentences, old context and new values
        contexts, valuess = [], []
        for old_emb, new_emb, idxs_replace in zip(old_embs, new_embs, idxs_replaces):
            context = old_emb.detach()
@@ -490,14 +490,6 @@ class FluxPipeline(
                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
            )

-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
        if prompt_embeds is not None and pooled_prompt_embeds is None:
            raise ValueError(
                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
@@ -821,7 +813,7 @@ class FluxPipeline(
            (
                negative_prompt_embeds,
                negative_pooled_prompt_embeds,
-                _,
+                negative_text_ids,
            ) = self.encode_prompt(
                prompt=negative_prompt,
                prompt_2=negative_prompt_2,
@@ -938,7 +930,7 @@ class FluxPipeline(
                        guidance=guidance,
                        pooled_projections=negative_pooled_prompt_embeds,
                        encoder_hidden_states=negative_prompt_embeds,
-                        txt_ids=text_ids,
+                        txt_ids=negative_text_ids,
                        img_ids=latent_image_ids,
                        joint_attention_kwargs=self.joint_attention_kwargs,
                        return_dict=False,
@@ -800,17 +800,20 @@ class FluxControlNetImg2ImgPipeline(DiffusionPipeline, FluxLoraLoaderMixin, From
            )
            height, width = control_image.shape[-2:]

-            control_image = retrieve_latents(self.vae.encode(control_image), generator=generator)
-            control_image = (control_image - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+            # xlab controlnet has a input_hint_block and instantx controlnet does not
+            controlnet_blocks_repeat = False if self.controlnet.input_hint_block is None else True
+            if self.controlnet.input_hint_block is None:
+                control_image = retrieve_latents(self.vae.encode(control_image), generator=generator)
+                control_image = (control_image - self.vae.config.shift_factor) * self.vae.config.scaling_factor

-            height_control_image, width_control_image = control_image.shape[2:]
-            control_image = self._pack_latents(
-                control_image,
-                batch_size * num_images_per_prompt,
-                num_channels_latents,
-                height_control_image,
-                width_control_image,
-            )
+                height_control_image, width_control_image = control_image.shape[2:]
+                control_image = self._pack_latents(
+                    control_image,
+                    batch_size * num_images_per_prompt,
+                    num_channels_latents,
+                    height_control_image,
+                    width_control_image,
+                )

            if control_mode is not None:
                control_mode = torch.tensor(control_mode).to(device, dtype=torch.long)
@@ -819,7 +822,9 @@ class FluxControlNetImg2ImgPipeline(DiffusionPipeline, FluxLoraLoaderMixin, From
        elif isinstance(self.controlnet, FluxMultiControlNetModel):
            control_images = []

-            for control_image_ in control_image:
+            # xlab controlnet has a input_hint_block and instantx controlnet does not
+            controlnet_blocks_repeat = False if self.controlnet.nets[0].input_hint_block is None else True
+            for i, control_image_ in enumerate(control_image):
                control_image_ = self.prepare_image(
                    image=control_image_,
                    width=width,
@@ -831,17 +836,18 @@ class FluxControlNetImg2ImgPipeline(DiffusionPipeline, FluxLoraLoaderMixin, From
                )
                height, width = control_image_.shape[-2:]

-                control_image_ = retrieve_latents(self.vae.encode(control_image_), generator=generator)
-                control_image_ = (control_image_ - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+                if self.controlnet.nets[0].input_hint_block is None:
+                    control_image_ = retrieve_latents(self.vae.encode(control_image_), generator=generator)
+                    control_image_ = (control_image_ - self.vae.config.shift_factor) * self.vae.config.scaling_factor

-                height_control_image, width_control_image = control_image_.shape[2:]
-                control_image_ = self._pack_latents(
-                    control_image_,
-                    batch_size * num_images_per_prompt,
-                    num_channels_latents,
-                    height_control_image,
-                    width_control_image,
-                )
+                    height_control_image, width_control_image = control_image_.shape[2:]
+                    control_image_ = self._pack_latents(
+                        control_image_,
+                        batch_size * num_images_per_prompt,
+                        num_channels_latents,
+                        height_control_image,
+                        width_control_image,
+                    )

                control_images.append(control_image_)

@@ -955,6 +961,7 @@ class FluxControlNetImg2ImgPipeline(DiffusionPipeline, FluxLoraLoaderMixin, From
                    img_ids=latent_image_ids,
                    joint_attention_kwargs=self.joint_attention_kwargs,
                    return_dict=False,
+                    controlnet_blocks_repeat=controlnet_blocks_repeat,
                )[0]

                latents_dtype = latents.dtype
@@ -13,6 +13,7 @@ from transformers import (
 )

 from ...image_processor import VaeImageProcessor
+from ...loaders import HiDreamImageLoraLoaderMixin
 from ...models import AutoencoderKL, HiDreamImageTransformer2DModel
 from ...schedulers import FlowMatchEulerDiscreteScheduler, UniPCMultistepScheduler
 from ...utils import deprecate, is_torch_xla_available, logging, replace_example_docstring
@@ -142,7 +143,7 @@ def retrieve_timesteps(
    return timesteps, num_inference_steps


-class HiDreamImagePipeline(DiffusionPipeline):
+class HiDreamImagePipeline(DiffusionPipeline, HiDreamImageLoraLoaderMixin):
    model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->text_encoder_4->transformer->vae"
    _callback_tensor_inputs = ["latents", "prompt_embeds_t5", "prompt_embeds_llama3", "pooled_prompt_embeds"]

@@ -822,13 +823,13 @@ class HiDreamImagePipeline(DiffusionPipeline):

        if prompt_embeds is not None:
            deprecation_message = "The `prompt_embeds` argument is deprecated. Please use `prompt_embeds_t5` and `prompt_embeds_llama3` instead."
-            deprecate("prompt_embeds", "0.34.0", deprecation_message)
+            deprecate("prompt_embeds", "0.35.0", deprecation_message)
            prompt_embeds_t5 = prompt_embeds[0]
            prompt_embeds_llama3 = prompt_embeds[1]

        if negative_prompt_embeds is not None:
            deprecation_message = "The `negative_prompt_embeds` argument is deprecated. Please use `negative_prompt_embeds_t5` and `negative_prompt_embeds_llama3` instead."
-            deprecate("negative_prompt_embeds", "0.34.0", deprecation_message)
+            deprecate("negative_prompt_embeds", "0.35.0", deprecation_message)
            negative_prompt_embeds_t5 = negative_prompt_embeds[0]
            negative_prompt_embeds_llama3 = negative_prompt_embeds[1]

@@ -14,14 +14,13 @@

 from typing import Callable, List, Optional, Union

-import numpy as np
 import PIL.Image
 import torch
-from PIL import Image
 from transformers import (
    XLMRobertaTokenizer,
 )

+from ...image_processor import VaeImageProcessor
 from ...models import UNet2DConditionModel, VQModel
 from ...schedulers import DDIMScheduler
 from ...utils import (
@@ -95,15 +94,6 @@ def get_new_h_w(h, w, scale_factor=8):
    return new_h * scale_factor, new_w * scale_factor


-def prepare_image(pil_image, w=512, h=512):
-    pil_image = pil_image.resize((w, h), resample=Image.BICUBIC, reducing_gap=1)
-    arr = np.array(pil_image.convert("RGB"))
-    arr = arr.astype(np.float32) / 127.5 - 1
-    arr = np.transpose(arr, [2, 0, 1])
-    image = torch.from_numpy(arr).unsqueeze(0)
-    return image
-
-
 class KandinskyImg2ImgPipeline(DiffusionPipeline):
    """
    Pipeline for image-to-image generation using Kandinsky
@@ -143,7 +133,16 @@ class KandinskyImg2ImgPipeline(DiffusionPipeline):
            scheduler=scheduler,
            movq=movq,
        )
-        self.movq_scale_factor = 2 ** (len(self.movq.config.block_out_channels) - 1)
+        self.movq_scale_factor = (
+            2 ** (len(self.movq.config.block_out_channels) - 1) if getattr(self, "movq", None) else 8
+        )
+        movq_latent_channels = self.movq.config.latent_channels if getattr(self, "movq", None) else 4
+        self.image_processor = VaeImageProcessor(
+            vae_scale_factor=self.movq_scale_factor,
+            vae_latent_channels=movq_latent_channels,
+            resample="bicubic",
+            reducing_gap=1,
+        )

    def get_timesteps(self, num_inference_steps, strength, device):
        # get the original timestep using init_timestep
@@ -417,7 +416,7 @@ class KandinskyImg2ImgPipeline(DiffusionPipeline):
                f"Input is in incorrect format: {[type(i) for i in image]}. Currently, we only support  PIL image and pytorch tensor"
            )

-        image = torch.cat([prepare_image(i, width, height) for i in image], dim=0)
+        image = torch.cat([self.image_processor.preprocess(i, width, height) for i in image], dim=0)
        image = image.to(dtype=prompt_embeds.dtype, device=device)

        latents = self.movq.encode(image)["latents"]
@@ -498,13 +497,7 @@ class KandinskyImg2ImgPipeline(DiffusionPipeline):
        if output_type not in ["pt", "np", "pil"]:
            raise ValueError(f"Only the output types `pt`, `pil` and `np` are supported not output_type={output_type}")

-        if output_type in ["np", "pil"]:
-            image = image * 0.5 + 0.5
-            image = image.clamp(0, 1)
-            image = image.cpu().permute(0, 2, 3, 1).float().numpy()
-
-        if output_type == "pil":
-            image = self.numpy_to_pil(image)
+        image = self.image_processor.postprocess(image, output_type)

        if not return_dict:
            return (image,)
@@ -14,11 +14,10 @@

 from typing import Callable, List, Optional, Union

-import numpy as np
 import PIL.Image
 import torch
-from PIL import Image

+from ...image_processor import VaeImageProcessor
 from ...models import UNet2DConditionModel, VQModel
 from ...schedulers import DDPMScheduler
 from ...utils import (
@@ -105,27 +104,6 @@ EXAMPLE_DOC_STRING = """
 """


-# Copied from diffusers.pipelines.kandinsky2_2.pipeline_kandinsky2_2.downscale_height_and_width
-def downscale_height_and_width(height, width, scale_factor=8):
-    new_height = height // scale_factor**2
-    if height % scale_factor**2 != 0:
-        new_height += 1
-    new_width = width // scale_factor**2
-    if width % scale_factor**2 != 0:
-        new_width += 1
-    return new_height * scale_factor, new_width * scale_factor
-
-
-# Copied from diffusers.pipelines.kandinsky.pipeline_kandinsky_img2img.prepare_image
-def prepare_image(pil_image, w=512, h=512):
-    pil_image = pil_image.resize((w, h), resample=Image.BICUBIC, reducing_gap=1)
-    arr = np.array(pil_image.convert("RGB"))
-    arr = arr.astype(np.float32) / 127.5 - 1
-    arr = np.transpose(arr, [2, 0, 1])
-    image = torch.from_numpy(arr).unsqueeze(0)
-    return image
-
-
 class KandinskyV22ControlnetImg2ImgPipeline(DiffusionPipeline):
    """
    Pipeline for image-to-image generation using Kandinsky
@@ -157,7 +135,14 @@ class KandinskyV22ControlnetImg2ImgPipeline(DiffusionPipeline):
            scheduler=scheduler,
            movq=movq,
        )
-        self.movq_scale_factor = 2 ** (len(self.movq.config.block_out_channels) - 1)
+        movq_scale_factor = 2 ** (len(self.movq.config.block_out_channels) - 1) if getattr(self, "movq", None) else 8
+        movq_latent_channels = self.movq.config.latent_channels if getattr(self, "movq", None) else 4
+        self.image_processor = VaeImageProcessor(
+            vae_scale_factor=movq_scale_factor,
+            vae_latent_channels=movq_latent_channels,
+            resample="bicubic",
+            reducing_gap=1,
+        )

    # Copied from diffusers.pipelines.kandinsky.pipeline_kandinsky_img2img.KandinskyImg2ImgPipeline.get_timesteps
    def get_timesteps(self, num_inference_steps, strength, device):
@@ -316,7 +301,7 @@ class KandinskyV22ControlnetImg2ImgPipeline(DiffusionPipeline):
                f"Input is in incorrect format: {[type(i) for i in image]}. Currently, we only support  PIL image and pytorch tensor"
            )

-        image = torch.cat([prepare_image(i, width, height) for i in image], dim=0)
+        image = torch.cat([self.image_processor.preprocess(i, width, height) for i in image], dim=0)
        image = image.to(dtype=image_embeds.dtype, device=device)

        latents = self.movq.encode(image)["latents"]
@@ -324,7 +309,6 @@ class KandinskyV22ControlnetImg2ImgPipeline(DiffusionPipeline):
        self.scheduler.set_timesteps(num_inference_steps, device=device)
        timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device)
        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
-        height, width = downscale_height_and_width(height, width, self.movq_scale_factor)
        latents = self.prepare_latents(
            latents, latent_timestep, batch_size, num_images_per_prompt, image_embeds.dtype, device, generator
        )
@@ -379,13 +363,7 @@ class KandinskyV22ControlnetImg2ImgPipeline(DiffusionPipeline):
        if output_type not in ["pt", "np", "pil"]:
            raise ValueError(f"Only the output types `pt`, `pil` and `np` are supported not output_type={output_type}")

-        if output_type in ["np", "pil"]:
-            image = image * 0.5 + 0.5
-            image = image.clamp(0, 1)
-            image = image.cpu().permute(0, 2, 3, 1).float().numpy()
-
-        if output_type == "pil":
-            image = self.numpy_to_pil(image)
+        image = self.image_processor.postprocess(image, output_type)

        if not return_dict:
            return (image,)
@@ -14,11 +14,10 @@

 from typing import Callable, Dict, List, Optional, Union

-import numpy as np
 import PIL.Image
 import torch
-from PIL import Image

+from ...image_processor import VaeImageProcessor
 from ...models import UNet2DConditionModel, VQModel
 from ...schedulers import DDPMScheduler
 from ...utils import deprecate, is_torch_xla_available, logging
@@ -76,27 +75,6 @@ EXAMPLE_DOC_STRING = """
 """


-# Copied from diffusers.pipelines.kandinsky2_2.pipeline_kandinsky2_2.downscale_height_and_width
-def downscale_height_and_width(height, width, scale_factor=8):
-    new_height = height // scale_factor**2
-    if height % scale_factor**2 != 0:
-        new_height += 1
-    new_width = width // scale_factor**2
-    if width % scale_factor**2 != 0:
-        new_width += 1
-    return new_height * scale_factor, new_width * scale_factor
-
-
-# Copied from diffusers.pipelines.kandinsky.pipeline_kandinsky_img2img.prepare_image
-def prepare_image(pil_image, w=512, h=512):
-    pil_image = pil_image.resize((w, h), resample=Image.BICUBIC, reducing_gap=1)
-    arr = np.array(pil_image.convert("RGB"))
-    arr = arr.astype(np.float32) / 127.5 - 1
-    arr = np.transpose(arr, [2, 0, 1])
-    image = torch.from_numpy(arr).unsqueeze(0)
-    return image
-
-
 class KandinskyV22Img2ImgPipeline(DiffusionPipeline):
    """
    Pipeline for image-to-image generation using Kandinsky
@@ -129,7 +107,14 @@ class KandinskyV22Img2ImgPipeline(DiffusionPipeline):
            scheduler=scheduler,
            movq=movq,
        )
-        self.movq_scale_factor = 2 ** (len(self.movq.config.block_out_channels) - 1)
+        movq_scale_factor = 2 ** (len(self.movq.config.block_out_channels) - 1) if getattr(self, "movq", None) else 8
+        movq_latent_channels = self.movq.config.latent_channels if getattr(self, "movq", None) else 4
+        self.image_processor = VaeImageProcessor(
+            vae_scale_factor=movq_scale_factor,
+            vae_latent_channels=movq_latent_channels,
+            resample="bicubic",
+            reducing_gap=1,
+        )

    # Copied from diffusers.pipelines.kandinsky.pipeline_kandinsky_img2img.KandinskyImg2ImgPipeline.get_timesteps
    def get_timesteps(self, num_inference_steps, strength, device):
@@ -319,7 +304,7 @@ class KandinskyV22Img2ImgPipeline(DiffusionPipeline):
                f"Input is in incorrect format: {[type(i) for i in image]}. Currently, we only support  PIL image and pytorch tensor"
            )

-        image = torch.cat([prepare_image(i, width, height) for i in image], dim=0)
+        image = torch.cat([self.image_processor.preprocess(i, width, height) for i in image], dim=0)
        image = image.to(dtype=image_embeds.dtype, device=device)

        latents = self.movq.encode(image)["latents"]
@@ -327,7 +312,6 @@ class KandinskyV22Img2ImgPipeline(DiffusionPipeline):
        self.scheduler.set_timesteps(num_inference_steps, device=device)
        timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, strength, device)
        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
-        height, width = downscale_height_and_width(height, width, self.movq_scale_factor)
        latents = self.prepare_latents(
            latents, latent_timestep, batch_size, num_images_per_prompt, image_embeds.dtype, device, generator
        )
@@ -383,21 +367,9 @@ class KandinskyV22Img2ImgPipeline(DiffusionPipeline):
            if XLA_AVAILABLE:
                xm.mark_step()

-        if output_type not in ["pt", "np", "pil", "latent"]:
-            raise ValueError(
-                f"Only the output types `pt`, `pil` ,`np` and `latent` are supported not output_type={output_type}"
-            )
-
        if not output_type == "latent":
-            # post-processing
            image = self.movq.decode(latents, force_not_quantize=True)["sample"]
-            if output_type in ["np", "pil"]:
-                image = image * 0.5 + 0.5
-                image = image.clamp(0, 1)
-                image = image.cpu().permute(0, 2, 3, 1).float().numpy()
-
-            if output_type == "pil":
-                image = self.numpy_to_pil(image)
+            image = self.image_processor.postprocess(image, output_type)
        else:
            image = latents

@@ -1,12 +1,12 @@
 import inspect
 from typing import Callable, Dict, List, Optional, Union

-import numpy as np
 import PIL
 import PIL.Image
 import torch
 from transformers import T5EncoderModel, T5Tokenizer

+from ...image_processor import VaeImageProcessor
 from ...loaders import StableDiffusionLoraLoaderMixin
 from ...models import Kandinsky3UNet, VQModel
 from ...schedulers import DDPMScheduler
@@ -53,24 +53,6 @@ EXAMPLE_DOC_STRING = """
 """


-def downscale_height_and_width(height, width, scale_factor=8):
-    new_height = height // scale_factor**2
-    if height % scale_factor**2 != 0:
-        new_height += 1
-    new_width = width // scale_factor**2
-    if width % scale_factor**2 != 0:
-        new_width += 1
-    return new_height * scale_factor, new_width * scale_factor
-
-
-def prepare_image(pil_image):
-    arr = np.array(pil_image.convert("RGB"))
-    arr = arr.astype(np.float32) / 127.5 - 1
-    arr = np.transpose(arr, [2, 0, 1])
-    image = torch.from_numpy(arr).unsqueeze(0)
-    return image
-
-
 class Kandinsky3Img2ImgPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixin):
    model_cpu_offload_seq = "text_encoder->movq->unet->movq"
    _callback_tensor_inputs = [
@@ -94,6 +76,14 @@ class Kandinsky3Img2ImgPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixi
        self.register_modules(
            tokenizer=tokenizer, text_encoder=text_encoder, unet=unet, scheduler=scheduler, movq=movq
        )
+        movq_scale_factor = 2 ** (len(self.movq.config.block_out_channels) - 1) if getattr(self, "movq", None) else 8
+        movq_latent_channels = self.movq.config.latent_channels if getattr(self, "movq", None) else 4
+        self.image_processor = VaeImageProcessor(
+            vae_scale_factor=movq_scale_factor,
+            vae_latent_channels=movq_latent_channels,
+            resample="bicubic",
+            reducing_gap=1,
+        )

    def get_timesteps(self, num_inference_steps, strength, device):
        # get the original timestep using init_timestep
@@ -566,7 +556,7 @@ class Kandinsky3Img2ImgPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixi
                f"Input is in incorrect format: {[type(i) for i in image]}. Currently, we only support  PIL image and pytorch tensor"
            )

-        image = torch.cat([prepare_image(i) for i in image], dim=0)
+        image = torch.cat([self.image_processor.preprocess(i) for i in image], dim=0)
        image = image.to(dtype=prompt_embeds.dtype, device=device)
        # 4. Prepare timesteps
        self.scheduler.set_timesteps(num_inference_steps, device=device)
@@ -630,20 +620,9 @@ class Kandinsky3Img2ImgPipeline(DiffusionPipeline, StableDiffusionLoraLoaderMixi
                    xm.mark_step()

            # post-processing
-            if output_type not in ["pt", "np", "pil", "latent"]:
-                raise ValueError(
-                    f"Only the output types `pt`, `pil`, `np` and `latent` are supported not output_type={output_type}"
-                )
            if not output_type == "latent":
                image = self.movq.decode(latents, force_not_quantize=True)["sample"]
-
-                if output_type in ["np", "pil"]:
-                    image = image * 0.5 + 0.5
-                    image = image.clamp(0, 1)
-                    image = image.cpu().permute(0, 2, 3, 1).float().numpy()
-
-                if output_type == "pil":
-                    image = self.numpy_to_pil(image)
+                image = self.image_processor.postprocess(image, output_type)
            else:
                image = latents

@@ -501,7 +501,7 @@ class LattePipeline(DiffusionPipeline):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -534,7 +534,7 @@ class LuminaPipeline(DiffusionPipeline):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
@@ -488,7 +488,7 @@ class PixArtSigmaPAGPipeline(DiffusionPipeline, PAGMixin):
        # &amp
        caption = re.sub(r"&amp", "", caption)

-        # ip adresses:
+        # ip addresses:
        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)

        # article ids:
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
sayakpaul	8d8621ec72	Merge branch 'main' into folderize-loaders	2025-04-29 00:28:06 +08:00
sayakpaul	0c37895440	consistency	2025-04-29 00:26:28 +08:00
Linoy Tsaban	0ac1d5b482	[Hi-Dream LoRA] fix bug in validation (#11439 ) remove unnecessary pipeline moving to cpu in validation Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-04-28 06:22:32 -10:00
sayakpaul	9bebdf225d	fix repo consistency.	2025-04-29 00:11:52 +08:00
sayakpaul	c05114d5ec	resolve conflicts.	2025-04-29 00:04:19 +08:00
sayakpaul	a57a5ab4c0	resolve conflicts.	2025-04-29 00:02:38 +08:00
Yao Matrix	7567adfc45	enable 28 GGUF test cases on XPU (#11404 ) * enable gguf test cases on XPU Signed-off-by: YAO Matrix <matrix.yao@intel.com> * make SD35LargeGGUFSingleFileTests::test_pipeline_inference pas Signed-off-by: root <root@a4bf01945cfe.jf.intel.com> * make FluxControlLoRAGGUFTests::test_lora_loading pass Signed-off-by: Yao Matrix <matrix.yao@intel.com> * polish code Signed-off-by: Yao Matrix <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: YAO Matrix <matrix.yao@intel.com> Signed-off-by: root <root@a4bf01945cfe.jf.intel.com> Signed-off-by: Yao Matrix <matrix.yao@intel.com> Co-authored-by: root <root@a4bf01945cfe.jf.intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-04-28 21:32:04 +05:30
tongyu	3da98e7ee3	[train_text_to_image_lora] Better image interpolation in training scripts follow up (#11427 ) * Update train_text_to_image_lora.py * update_train_text_to_image_lora	2025-04-28 11:23:24 -04:00
tongyu	b3b04fefde	[train_text_to_image] Better image interpolation in training scripts follow up (#11426 ) * Update train_text_to_image.py * update	2025-04-28 10:50:33 -04:00
Sayak Paul	0e3f2713c2	[tests] fix import. (#11434 ) fix import.	2025-04-28 13:32:28 +08:00
Yao Matrix	a7e9f85e21	enable test_layerwise_casting_memory cases on XPU (#11406 ) * enable test_layerwise_casting_memory cases on XPU Signed-off-by: Yao Matrix <matrix.yao@intel.com> * fix style Signed-off-by: Yao Matrix <matrix.yao@intel.com> --------- Signed-off-by: Yao Matrix <matrix.yao@intel.com>	2025-04-28 06:38:39 +05:30
Yao Matrix	9ce89e2efa	enable group_offload cases and quanto cases on XPU (#11405 ) * enable group_offload cases and quanto cases on XPU Signed-off-by: YAO Matrix <matrix.yao@intel.com> * use backend APIs Signed-off-by: Yao Matrix <matrix.yao@intel.com> * fix style Signed-off-by: Yao Matrix <matrix.yao@intel.com> --------- Signed-off-by: YAO Matrix <matrix.yao@intel.com> Signed-off-by: Yao Matrix <matrix.yao@intel.com>	2025-04-28 06:37:16 +05:30
Sayak Paul	aa5f5d41d6	[tests] add tests to check for graph breaks, recompilation, cuda syncs in pipelines during torch.compile() (#11085 ) * test for better torch.compile stuff. * fixes * recompilation and graph break. * clear compilation cache. * change to modeling level test. * allow running compilation tests during nightlies.	2025-04-28 08:36:33 +08:00
Mert Erbak	bd96a084d3	[train_dreambooth_lora.py] Set LANCZOS as default interpolation mode for resizing (#11421 ) * Set LANCZOS as default interpolation mode for resizing * [train_dreambooth_lora.py] Set LANCZOS as default interpolation mode for resizing	2025-04-26 01:58:41 -04:00
co63oc	f00a995753	Fix typos in strings and comments (#11407 )	2025-04-24 08:53:47 -10:00
Ishan Modi	e8312e7ca9	[BUG] fixed WAN docstring (#11226 ) update	2025-04-24 08:49:37 -10:00
Emiliano	7986834572	Fix Flux IP adapter argument in the pipeline example (#11402 ) Fix Flux IP adapter argument in the example IP-Adapter example had a wrong argument. Fix `true_cfg` -> `true_cfg_scale`	2025-04-24 08:41:12 -10:00
Linoy Tsaban	edd7880418	[HiDream LoRA] optimizations + small updates (#11381 ) * 1. add pre-computation of prompt embeddings when custom prompts are used as well 2. save model card even if model is not pushed to hub 3. remove scheduler initialization from code example - not necessary anymore (it's now if the base model's config) 4. add skip_final_inference - to allow to run with validation, but skip the final loading of the pipeline with the lora weights to reduce memory reqs * pre encode validation prompt as well * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * pre encode validation prompt as well * Apply style fixes * empty commit * change default trained modules * empty commit * address comments + change encoding of validation prompt (before it was only pre-encoded if custom prompts are provided, but should be pre-encoded either way) * Apply style fixes * empty commit * fix validation_embeddings definition * fix final inference condition * fix pipeline deletion in last inference * Apply style fixes * empty commit * layers * remove readme remarks on only pre-computing when instance prompt is provided and change example to 3d icons * smol fix * empty commit --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-04-24 07:48:19 +03:00
Teriks	b4be42282d	Kolors additional pipelines, community contrib (#11372 ) * Kolors additional pipelines, community contrib --------- Co-authored-by: Teriks <Teriks@users.noreply.github.com> Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>	2025-04-23 11:07:27 -10:00
Ishan Modi	a4f9c3cbc3	[Feature] Added Xlab Controlnet support (#11249 ) update	2025-04-23 10:43:50 -10:00
Ishan Dutta	4b60f4b602	[train_dreambooth_flux] Add LANCZOS as the default interpolation mode for image resizing (#11395 )	2025-04-23 10:47:05 -04:00
Aryan	6cef71de3a	Fix group offloading with block_level and use_stream=True (#11375 ) * fix * add tests * add message check	2025-04-23 18:17:53 +05:30
Ameer Azam	026507c06c	Update README_hidream.md (#11386 ) Small change requirements_sana.txt to requirements_hidream.txt	2025-04-22 20:08:26 -04:00
YiYi Xu	448c72a230	[HiDream] move deprecation to 0.35.0 (#11384 ) up	2025-04-22 08:08:08 -10:00
Aryan	f108ad8888	Update modeling imports (#11129 ) update	2025-04-22 06:59:25 -10:00
Linoy Tsaban	e30d3bf544	[LoRA] add LoRA support to HiDream and fine-tuning script (#11281 ) * initial commit * initial commit * initial commit * initial commit * initial commit * initial commit * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: Bagheera <59658056+bghira@users.noreply.github.com> * move prompt embeds, pooled embeds outside * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: hlky <hlky@hlky.ac> * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: hlky <hlky@hlky.ac> * fix import * fix import and tokenizer 4, text encoder 4 loading * te * prompt embeds * fix naming * shapes * initial commit to add HiDreamImageLoraLoaderMixin * fix init * add tests * loader * fix model input * add code example to readme * fix default max length of text encoders * prints * nullify training cond in unpatchify for temp fix to incompatible shaping of transformer output during training * smol fix * unpatchify * unpatchify * fix validation * flip pred and loss * fix shift!!! * revert unpatchify changes (for now) * smol fix * Apply style fixes * workaround moe training * workaround moe training * remove prints * to reduce some memory, keep vae in `weight_dtype` same as we have for flux (as it's the same vae) https://github.com/huggingface/diffusers/blob/bbd0c161b55ba2234304f1e6325832dd69c60565/examples/dreambooth/train_dreambooth_lora_flux.py#L1207 * refactor to align with HiDream refactor * refactor to align with HiDream refactor * refactor to align with HiDream refactor * add support for cpu offloading of text encoders * Apply style fixes * adjust lr and rank for train example * fix copies * Apply style fixes * update README * update README * update README * fix license * keep prompt2,3,4 as None in validation * remove reverse ode comment * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update examples/dreambooth/train_dreambooth_lora_hidream.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * vae offload change * fix text encoder offloading * Apply style fixes * cleaner to_kwargs * fix module name in copied from * add requirements * fix offloading * fix offloading * fix offloading * update transformers version in reqs * try AutoTokenizer * try AutoTokenizer * Apply style fixes * empty commit * Delete tests/lora/test_lora_layers_hidream.py * change tokenizer_4 to load with AutoTokenizer as well * make text_encoder_four and tokenizer_four configurable * save model card * save model card * revert T5 * fix test * remove non diffusers lumina2 conversion --------- Co-authored-by: Bagheera <59658056+bghira@users.noreply.github.com> Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-04-22 11:44:02 +03:00
apolinário	6ab62c7431	Add stochastic sampling to FlowMatchEulerDiscreteScheduler (#11369 ) * Add stochastic sampling to FlowMatchEulerDiscreteScheduler This PR adds stochastic sampling to FlowMatchEulerDiscreteScheduler based on https://github.com/Lightricks/LTX-Video/commit/b1aeddd7ccac85e6d1b0d97762610ddb53c1b408 ltx_video/schedulers/rf.py * Apply style fixes * Use config value directly * Apply style fixes * Swap order * Update src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py Co-authored-by: YiYi Xu <yixu310@gmail.com> --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>	2025-04-21 17:18:30 -10:00
Ishan Modi	f59df3bb8b	[Refactor] Minor Improvement for import utils (#11161 ) * update * update * addressed PR comments * update --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>	2025-04-21 09:56:55 -10:00
josephrocca	a00c73a5e1	Support different-length pos/neg prompts for FLUX.1-schnell variants like Chroma (#11120 ) Co-authored-by: YiYi Xu <yixu310@gmail.com>	2025-04-21 09:28:19 -10:00
OleehyO	0434db9a99	[cogview4][feat] Support attention mechanism with variable-length support and batch packing (#11349 ) * [cogview4] Enhance attention mechanism with variable-length support and batch packing --------- Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-04-21 09:27:55 -10:00
Aamir Nazir	aff574fb29	Add Serialized Type Name kwarg in Model Output (#10502 ) * Update outputs.py	2025-04-21 08:45:28 -10:00
Ishan Modi	79ea8eb258	[BUG] fixes in kadinsky pipeline (#11080 ) * bug fix kadinsky pipeline	2025-04-21 08:41:09 -10:00
Aryan	e7f3a73786	Fix Wan I2V prepare_latents dtype (#11371 ) update	2025-04-21 08:18:50 -10:00
PromeAI	7a4a126db8	fix issue that training flux controlnet was unstable and validation r… (#11373 ) * fix issue that training flux controlnet was unstable and validation results were unstable * del unused code pieces, fix grammar --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-04-21 08:16:05 -10:00
sayakpaul	4b1c7dc81a	resolve conflicts.	2025-04-17 09:27:54 +05:30
Sayak Paul	1590325a60	Merge branch 'main' into folderize-loaders	2025-04-16 18:41:54 +05:30
sayakpaul	e4dd7c5333	updates	2025-04-16 18:26:27 +05:30
sayakpaul	d6430c79a3	updates	2025-04-16 18:11:39 +05:30
sayakpaul	1597ae6ac9	updates	2025-04-16 17:37:17 +05:30
sayakpaul	11a23d11fe	updates	2025-04-16 17:29:26 +05:30
sayakpaul	6b8b225aca	single file utils.	2025-04-16 17:26:30 +05:30
sayakpaul	27d2401e59	partially complete single_file_utils	2025-04-16 16:52:54 +05:30
sayakpaul	1ddfe14220	single_file	2025-04-16 16:30:39 +05:30
sayakpaul	0e8d1d25eb	ip_adapter	2025-04-16 15:59:01 +05:30
sayakpaul	546446ae21	ip_adapter.	2025-04-16 15:52:57 +05:30
sayakpaul	ea3f0b8d68	update	2025-04-16 15:49:14 +05:30
sayakpaul	f0ea9ff2e2	deprecate lora loader from loaders easily.	2025-04-16 15:47:40 +05:30
sayakpaul	1b7c286974	fix	2025-04-16 15:09:23 +05:30
sayakpaul	6138cc1720	updates	2025-04-16 13:01:48 +05:30
sayakpaul	ea0ce4bfab	fixes	2025-04-16 12:50:09 +05:30
sayakpaul	f2aa2f91dc	fix	2025-04-16 12:46:04 +05:30
sayakpaul	4faac73219	update	2025-04-16 12:38:58 +05:30
sayakpaul	d870e3c9a6	update	2025-04-16 12:35:09 +05:30
sayakpaul	178b884673	updates	2025-04-16 12:29:16 +05:30
sayakpaul	2da3cb4a8c	fixes	2025-04-16 12:27:37 +05:30
sayakpaul	ea3ba4f431	fies	2025-04-16 12:26:30 +05:30
sayakpaul	21b2566933	fixes	2025-04-16 12:23:16 +05:30
sayakpaul	a71334b861	fixes	2025-04-16 12:22:12 +05:30
sayakpaul	eb47a67d50	fix	2025-04-16 12:18:30 +05:30
sayakpaul	8267677a24	start folderizing the loaders.	2025-04-16 12:02:06 +05:30