Update docs/source/en/quantization/gguf.md

Co-authored-by: Aryan <aryan@huggingface.co>
update
2024-12-18 17:36:27 +05:30 · 2024-12-18 10:48:20 +05:30
163 changed files with 1540 additions and 7968 deletions
@@ -359,8 +359,6 @@ jobs:
            test_location: "bnb"
          - backend: "gguf"
            test_location: "gguf"
-          - backend: "torchao"
-            test_location: "torchao"
    runs-on:
      group: aws-g6e-xlarge-plus
    container:
@@ -83,7 +83,7 @@ jobs:
          python utils/print_env.py
      - name: PyTorch CUDA checkpoint tests on Ubuntu
        env:
-          HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
          # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
          CUBLAS_WORKSPACE_CONFIG: :16:8
        run: |
@@ -137,7 +137,7 @@ jobs:

    - name: Run PyTorch CUDA tests
      env:
-        HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
+        HF_TOKEN: ${{ secrets.HF_TOKEN }}
        # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
        CUBLAS_WORKSPACE_CONFIG: :16:8
      run: |
@@ -46,7 +46,7 @@ jobs:
      shell: arch -arch arm64 bash {0}
      run: |
        ${CONDA_RUN} python -m pip install --upgrade pip uv
-        ${CONDA_RUN} python -m uv pip install -e ".[quality,test]"
+        ${CONDA_RUN} python -m uv pip install -e [quality,test]
        ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio
        ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
        ${CONDA_RUN} python -m uv pip install transformers --upgrade
@@ -68,7 +68,7 @@ jobs:
      - name: Test installing diffusers and importing
        run: |
          pip install diffusers && pip uninstall diffusers -y
-          pip install -i https://test.pypi.org/simple/ diffusers
+          pip install -i https://testpypi.python.org/pypi diffusers
          python -c "from diffusers import __version__; print(__version__)"
          python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()"
          python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')"
@@ -238,8 +238,6 @@
      title: Textual Inversion
    - local: api/loaders/unet
      title: UNet
-    - local: api/loaders/transformer_sd3
-      title: SD3Transformer2D
    - local: api/loaders/peft
      title: PEFT
    title: Loaders
@@ -402,8 +400,6 @@
      title: DiT
    - local: api/pipelines/flux
      title: Flux
-    - local: api/pipelines/control_flux_inpaint
-      title: FluxControlInpaint
    - local: api/pipelines/hunyuandit
      title: Hunyuan-DiT
    - local: api/pipelines/hunyuan_video
@@ -429,7 +425,7 @@
    - local: api/pipelines/ledits_pp
      title: LEDITS++
    - local: api/pipelines/ltx_video
-      title: LTXVideo
+      title: LTX
    - local: api/pipelines/lumina
      title: Lumina-T2X
    - local: api/pipelines/marigold
@@ -86,8 +86,6 @@ An attention processor is a class for applying different types of attention mech

 [[autodoc]] models.attention_processor.IPAdapterAttnProcessor2_0

-[[autodoc]] models.attention_processor.SD3IPAdapterJointAttnProcessor2_0
-
 ## JointAttnProcessor2_0

 [[autodoc]] models.attention_processor.JointAttnProcessor2_0
@@ -24,12 +24,6 @@ Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading]

 [[autodoc]] loaders.ip_adapter.IPAdapterMixin

-## SD3IPAdapterMixin
-
-[[autodoc]] loaders.ip_adapter.SD3IPAdapterMixin
-    - all
-    - is_ip_adapter_active
-
 ## IPAdapterMaskProcessor

 [[autodoc]] image_processor.IPAdapterMaskProcessor
@@ -1,29 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# SD3Transformer2D
-
-This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead.
-
-The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs.
-
-<Tip>
-
-To learn more about how to load LoRA weights, see the [LoRA](../../using-diffusers/loading_adapters#lora) loading guide.
-
-</Tip>
-
-## SD3Transformer2DLoadersMixin
-
-[[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin
-    - all
-    - _load_ip_adapter_weights
@@ -18,7 +18,7 @@ The model can be loaded with the following code snippet.
 ```python
 from diffusers import AutoencoderKLHunyuanVideo

-vae = AutoencoderKLHunyuanVideo.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder="vae", torch_dtype=torch.float16)
+vae = AutoencoderKLHunyuanVideo.from_pretrained("tencent/HunyuanVideo", torch_dtype=torch.float16)
 ```

 ## AutoencoderKLHunyuanVideo
@@ -18,7 +18,7 @@ The model can be loaded with the following code snippet.
 ```python
 from diffusers import AutoencoderKLLTXVideo

-vae = AutoencoderKLLTXVideo.from_pretrained("Lightricks/LTX-Video", subfolder="vae", torch_dtype=torch.float32).to("cuda")
+vae = AutoencoderKLLTXVideo.from_pretrained("TODO/TODO", subfolder="vae", torch_dtype=torch.float32).to("cuda")
 ```

 ## AutoencoderKLLTXVideo
@@ -18,7 +18,7 @@ The model can be loaded with the following code snippet.
 ```python
 from diffusers import HunyuanVideoTransformer3DModel

-transformer = HunyuanVideoTransformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16)
+transformer = HunyuanVideoTransformer3DModel.from_pretrained("tencent/HunyuanVideo", torch_dtype=torch.bfloat16)
 ```

 ## HunyuanVideoTransformer3DModel
@@ -18,7 +18,7 @@ The model can be loaded with the following code snippet.
 ```python
 from diffusers import LTXVideoTransformer3DModel

-transformer = LTXVideoTransformer3DModel.from_pretrained("Lightricks/LTX-Video", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
+transformer = LTXVideoTransformer3DModel.from_pretrained("TODO/TODO", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
 ```

 ## LTXVideoTransformer3DModel
@@ -22,7 +22,7 @@ The model can be loaded with the following code snippet.
 ```python
 from diffusers import SanaTransformer2DModel

-transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
+transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_diffusers", subfolder="transformer", torch_dtype=torch.float16)
 ```

 ## SanaTransformer2DModel
@@ -1,89 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team, The Black Forest Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# FluxControlInpaint
-
-FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image.
-
-FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**.
-
-| Control type | Developer | Link |
-| -------- | ---------- | ---- |
-| Depth | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) |
-| Canny | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) |
-
-
-<Tip>
-
-Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c).
-
-</Tip>
-
-```python
-import torch
-from diffusers import FluxControlInpaintPipeline
-from diffusers.models.transformers import FluxTransformer2DModel
-from transformers import T5EncoderModel
-from diffusers.utils import load_image, make_image_grid
-from image_gen_aux import DepthPreprocessor # https://github.com/huggingface/image_gen_aux
-from PIL import Image
-import numpy as np
-
-pipe = FluxControlInpaintPipeline.from_pretrained(
-    "black-forest-labs/FLUX.1-Depth-dev",
-    torch_dtype=torch.bfloat16,
-)
-# use following lines if you have GPU constraints
-# ---------------------------------------------------------------
-transformer = FluxTransformer2DModel.from_pretrained(
-    "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="transformer", torch_dtype=torch.bfloat16
-)
-text_encoder_2 = T5EncoderModel.from_pretrained(
-    "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="text_encoder_2", torch_dtype=torch.bfloat16
-)
-pipe.transformer = transformer
-pipe.text_encoder_2 = text_encoder_2
-pipe.enable_model_cpu_offload()
-# ---------------------------------------------------------------
-pipe.to("cuda")
-
-prompt = "a blue robot singing opera with human-like expressions"
-image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
-
-head_mask = np.zeros_like(image)
-head_mask[65:580,300:642] = 255
-mask_image = Image.fromarray(head_mask)
-
-processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
-control_image = processor(image)[0].convert("RGB")
-
-output = pipe(
-    prompt=prompt,
-    image=image,
-    control_image=control_image,
-    mask_image=mask_image,
-    num_inference_steps=30,
-    strength=0.9,
-    guidance_scale=10.0,
-    generator=torch.Generator().manual_seed(42),
-).images[0]
-make_image_grid([image, control_image, mask_image, output.resize(image.size)], rows=1, cols=4).save("output.png")
-```
-
-## FluxControlInpaintPipeline
-[[autodoc]] FluxControlInpaintPipeline
-	- all
-	- __call__
-
-
-## FluxPipelineOutput
-[[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput
@@ -268,47 +268,6 @@ images = pipe(
 images[0].save("flux-redux.png")
 ```

-## Combining Flux Turbo LoRAs with Flux Control, Fill, and Redux
-
-We can combine Flux Turbo LoRAs with Flux Control and other pipelines like Fill and Redux to enable few-steps' inference. The example below shows how to do that for Flux Control LoRA for depth and turbo LoRA from [`ByteDance/Hyper-SD`](https://hf.co/ByteDance/Hyper-SD).
-
-```py
-from diffusers import FluxControlPipeline
-from image_gen_aux import DepthPreprocessor
-from diffusers.utils import load_image
-from huggingface_hub import hf_hub_download
-import torch
-
-control_pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
-control_pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora", adapter_name="depth")
-control_pipe.load_lora_weights(
-    hf_hub_download("ByteDance/Hyper-SD", "Hyper-FLUX.1-dev-8steps-lora.safetensors"), adapter_name="hyper-sd"
-)
-control_pipe.set_adapters(["depth", "hyper-sd"], adapter_weights=[0.85, 0.125])
-control_pipe.enable_model_cpu_offload()
-
-prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
-control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
-
-processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
-control_image = processor(control_image)[0].convert("RGB")
-
-image = control_pipe(
-    prompt=prompt,
-    control_image=control_image,
-    height=1024,
-    width=1024,
-    num_inference_steps=8,
-    guidance_scale=10.0,
-    generator=torch.Generator().manual_seed(42),
-).images[0]
-image.save("output.png")
-```
-
-## Note about `unload_lora_weights()` when using Flux LoRAs
-
-When unloading the Control LoRA weights, call `pipe.unload_lora_weights(reset_to_overwritten_params=True)` to reset the `pipe.transformer` completely back to its original form. The resultant pipeline can then be used with methods like [`DiffusionPipeline.from_pipe`]. More details about this argument are available in [this PR](https://github.com/huggingface/diffusers/pull/10397).
-
 ## Running FP16 inference

 Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details.
@@ -29,7 +29,7 @@ Recommendations for inference:
 - Transformer should be in `torch.bfloat16`.
 - VAE should be in `torch.float16`.
 - `num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
- For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
+- For smaller resolution images, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
 - For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).

 ## HunyuanVideoPipeline
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License. -->

-# LTX Video
+# LTX

 [LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.

@@ -22,24 +22,14 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m

 </Tip>

-Available models:
-
-|  Model name   | Recommended dtype |
-|:-------------:|:-----------------:|
-| [`LTX Video 0.9.0`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.safetensors) | `torch.bfloat16` |
-| [`LTX Video 0.9.1`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) | `torch.bfloat16` |
-
-Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either `torch.float32`, `torch.bfloat16` or `torch.float16` but the recommended dtype is `torch.bfloat16` as used in the original repository.
-
 ## Loading Single Files

-Loading the original LTX Video checkpoints is also possible with [`~ModelMixin.from_single_file`]. We recommend using `from_single_file` for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
+Loading the original LTX Video checkpoints is also possible with [`~ModelMixin.from_single_file`].

 ```python
 import torch
 from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel

-# `single_file_url` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
 single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
 transformer = LTXVideoTransformer3DModel.from_single_file(
  single_file_url, torch_dtype=torch.bfloat16
@@ -71,72 +61,6 @@ pipe = LTXImageToVideoPipeline.from_single_file(
 )
 ```

-Loading [LTX GGUF checkpoints](https://huggingface.co/city96/LTX-Video-gguf) are also supported:
-
-```py
-import torch
-from diffusers.utils import export_to_video
-from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
-
-ckpt_path = (
-    "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf"
-)
-transformer = LTXVideoTransformer3DModel.from_single_file(
-    ckpt_path,
-    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
-    torch_dtype=torch.bfloat16,
-)
-pipe = LTXPipeline.from_pretrained(
-    "Lightricks/LTX-Video",
-    transformer=transformer,
-    torch_dtype=torch.bfloat16,
-)
-pipe.enable_model_cpu_offload()
-
-prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
-negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
-
-video = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    width=704,
-    height=480,
-    num_frames=161,
-    num_inference_steps=50,
-).frames[0]
-export_to_video(video, "output_gguf_ltx.mp4", fps=24)
-```
-
-Make sure to read the [documentation on GGUF](../../quantization/gguf) to learn more about our GGUF support.
-
-<!-- TODO(aryan): Update this when official weights are supported -->
-
-Loading and running inference with [LTX Video 0.9.1](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) weights.
-
-```python
-import torch
-from diffusers import LTXPipeline
-from diffusers.utils import export_to_video
-
-pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
-pipe.to("cuda")
-
-prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
-negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
-
-video = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    width=768,
-    height=512,
-    num_frames=161,
-    decode_timestep=0.03,
-    decode_noise_scale=0.025,
-    num_inference_steps=50,
-).frames[0]
-export_to_video(video, "output.mp4", fps=24)
-```
-
 Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.

 ## LTXPipeline
@@ -13,7 +13,7 @@
 # limitations under the License.
 -->

-# Mochi 1 Preview
+# Mochi

 [Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) from Genmo.

@@ -25,201 +25,6 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m

 </Tip>

-## Generating videos with Mochi-1 Preview
-
-The following example will download the full precision `mochi-1-preview` weights and produce the highest quality results but will require at least 42GB VRAM to run.
-
-```python
-import torch
-from diffusers import MochiPipeline
-from diffusers.utils import export_to_video
-
-pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
-
-# Enable memory savings
-pipe.enable_model_cpu_offload()
-pipe.enable_vae_tiling()
-
-prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
-
-with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
-      frames = pipe(prompt, num_frames=85).frames[0]
-
-export_to_video(frames, "mochi.mp4", fps=30)
-```
-
-## Using a lower precision variant to save memory
-
-The following example will use the `bfloat16` variant of the model and requires 22GB VRAM to run. There is a slight drop in the quality of the generated video as a result.
-
-```python
-import torch
-from diffusers import MochiPipeline
-from diffusers.utils import export_to_video
-
-pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16)
-
-# Enable memory savings
-pipe.enable_model_cpu_offload()
-pipe.enable_vae_tiling()
-
-prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
-frames = pipe(prompt, num_frames=85).frames[0]
-
-export_to_video(frames, "mochi.mp4", fps=30)
-```
-
-## Reproducing the results from the Genmo Mochi repo
-
-The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the the original implementation, please refer to the following example.
-
-<Tip>
-The original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder.
-
-When enabling `force_zeros_for_empty_prompt`, it is recommended to run the text encoding step outside the autocast context in full precision.
-</Tip>
-
-<Tip>
-Decoding the latents in full precision is very memory intensive. You will need at least 70GB VRAM to generate the 163 frames in this example. To reduce memory, either reduce the number of frames or run the decoding step in `torch.bfloat16`.
-</Tip>
-
-```python
-import torch
-from torch.nn.attention import SDPBackend, sdpa_kernel
-
-from diffusers import MochiPipeline
-from diffusers.utils import export_to_video
-from diffusers.video_processor import VideoProcessor
-
-pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", force_zeros_for_empty_prompt=True)
-pipe.enable_vae_tiling()
-pipe.enable_model_cpu_offload()
-
-prompt =  "An aerial shot of a parade of elephants walking across the African savannah. The camera showcases the herd and the surrounding landscape."
-
-with torch.no_grad():
-    prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask = (
-        pipe.encode_prompt(prompt=prompt)
-    )
-
-with torch.autocast("cuda", torch.bfloat16):
-    with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
-        frames = pipe(
-            prompt_embeds=prompt_embeds,
-            prompt_attention_mask=prompt_attention_mask,
-            negative_prompt_embeds=negative_prompt_embeds,
-            negative_prompt_attention_mask=negative_prompt_attention_mask,
-            guidance_scale=4.5,
-            num_inference_steps=64,
-            height=480,
-            width=848,
-            num_frames=163,
-            generator=torch.Generator("cuda").manual_seed(0),
-            output_type="latent",
-            return_dict=False,
-        )[0]
-
-video_processor = VideoProcessor(vae_scale_factor=8)
-has_latents_mean = hasattr(pipe.vae.config, "latents_mean") and pipe.vae.config.latents_mean is not None
-has_latents_std = hasattr(pipe.vae.config, "latents_std") and pipe.vae.config.latents_std is not None
-if has_latents_mean and has_latents_std:
-    latents_mean = (
-        torch.tensor(pipe.vae.config.latents_mean).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype)
-    )
-    latents_std = (
-        torch.tensor(pipe.vae.config.latents_std).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype)
-    )
-    frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean
-else:
-    frames = frames / pipe.vae.config.scaling_factor
-
-with torch.no_grad():
-    video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0]
-
-video = video_processor.postprocess_video(video)[0]
-export_to_video(video, "mochi.mp4", fps=30)
-```
-
-## Running inference with multiple GPUs
-
-It is possible to split the large Mochi transformer across multiple GPUs using the `device_map` and `max_memory` options in `from_pretrained`. In the following example we split the model across two GPUs, each with 24GB of VRAM.
-
-```python
-import torch
-from diffusers import MochiPipeline, MochiTransformer3DModel
-from diffusers.utils import export_to_video
-
-model_id = "genmo/mochi-1-preview"
-transformer = MochiTransformer3DModel.from_pretrained(
-    model_id,
-    subfolder="transformer",
-    device_map="auto",
-    max_memory={0: "24GB", 1: "24GB"}
-)
-
-pipe = MochiPipeline.from_pretrained(model_id,  transformer=transformer)
-pipe.enable_model_cpu_offload()
-pipe.enable_vae_tiling()
-
-with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False):
-    frames = pipe(
-        prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
-        negative_prompt="",
-        height=480,
-        width=848,
-        num_frames=85,
-        num_inference_steps=50,
-        guidance_scale=4.5,
-        num_videos_per_prompt=1,
-        generator=torch.Generator(device="cuda").manual_seed(0),
-        max_sequence_length=256,
-        output_type="pil",
-    ).frames[0]
-
-export_to_video(frames, "output.mp4", fps=30)
-```
-
-## Using single file loading with the Mochi Transformer
-
-You can use `from_single_file` to load the Mochi transformer in its original format.
-
-<Tip>
-Diffusers currently doesn't support using the FP8 scaled versions of the Mochi single file checkpoints.
-</Tip>
-
-```python
-import torch
-from diffusers import MochiPipeline, MochiTransformer3DModel
-from diffusers.utils import export_to_video
-
-model_id = "genmo/mochi-1-preview"
-
-ckpt_path = "https://huggingface.co/Comfy-Org/mochi_preview_repackaged/blob/main/split_files/diffusion_models/mochi_preview_bf16.safetensors"
-
-transformer = MochiTransformer3DModel.from_pretrained(ckpt_path, torch_dtype=torch.bfloat16)
-
-pipe = MochiPipeline.from_pretrained(model_id,  transformer=transformer)
-pipe.enable_model_cpu_offload()
-pipe.enable_vae_tiling()
-
-with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False):
-    frames = pipe(
-        prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
-        negative_prompt="",
-        height=480,
-        width=848,
-        num_frames=85,
-        num_inference_steps=50,
-        guidance_scale=4.5,
-        num_videos_per_prompt=1,
-        generator=torch.Generator(device="cuda").manual_seed(0),
-        max_sequence_length=256,
-        output_type="pil",
-    ).frames[0]
-
-export_to_video(frames, "output.mp4", fps=30)
-```
-
 ## MochiPipeline

 [[autodoc]] MochiPipeline
@@ -32,9 +32,9 @@ Available models:

 | Model | Recommended dtype |
 |:-----:|:-----------------:|
-| [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` |
 | [`Efficient-Large-Model/Sana_1600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers) | `torch.float16` |
 | [`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | `torch.float16` |
+| [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` |
 | [`Efficient-Large-Model/Sana_1600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) | `torch.float16` |
 | [`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers) | `torch.float16` |
 | [`Efficient-Large-Model/Sana_600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers) | `torch.float16` |
@@ -59,76 +59,9 @@ image.save("sd3_hello_world.png")
 - [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large)
 - [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo)

-## Image Prompting with IP-Adapters
-
-An IP-Adapter lets you prompt SD3 with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. To load and use an IP-Adapter, you need:
-
- `image_encoder`: Pre-trained vision model used to obtain image features, usually a CLIP image encoder.
- `feature_extractor`: Image processor that prepares the input image for the chosen `image_encoder`.
- `ip_adapter_id`: Checkpoint containing parameters of image cross attention layers and image projection. 
-
-IP-Adapters are trained for a specific model architecture, so they also work in finetuned variations of the base model. You can use the [`~SD3IPAdapterMixin.set_ip_adapter_scale`] function to adjust how strongly the output aligns with the image prompt. The higher the value, the more closely the model follows the image prompt. A default value of 0.5 is typically a good balance, ensuring the model considers both the text and image prompts equally.
-
-```python
-import torch
-from PIL import Image
-
-from diffusers import StableDiffusion3Pipeline
-from transformers import SiglipVisionModel, SiglipImageProcessor
-
-image_encoder_id = "google/siglip-so400m-patch14-384"
-ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"
-
-feature_extractor = SiglipImageProcessor.from_pretrained(
-    image_encoder_id,
-    torch_dtype=torch.float16
-)
-image_encoder = SiglipVisionModel.from_pretrained(
-    image_encoder_id,
-    torch_dtype=torch.float16
-).to( "cuda")
-
-pipe = StableDiffusion3Pipeline.from_pretrained(
-    "stabilityai/stable-diffusion-3.5-large",
-    torch_dtype=torch.float16,
-    feature_extractor=feature_extractor,
-    image_encoder=image_encoder,
-).to("cuda")
-
-pipe.load_ip_adapter(ip_adapter_id)
-pipe.set_ip_adapter_scale(0.6)
-
-ref_img = Image.open("image.jpg").convert('RGB')
-
-image = pipe(
-    width=1024,
-    height=1024,
-    prompt="a cat",
-    negative_prompt="lowres, low quality, worst quality",
-    num_inference_steps=24,
-    guidance_scale=5.0,
-    ip_adapter_image=ref_img
-).images[0]
-
-image.save("result.jpg")
-```
-
-<div class="justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sd3_ip_adapter_example.png"/>
-    <figcaption class="mt-2 text-sm text-center text-gray-500">IP-Adapter examples with prompt "a cat"</figcaption>
-</div>
-
-
-<Tip>
-
-Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work.
-
-</Tip>
-
-
 ## Memory Optimisations for SD3

-SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.
+SD3 uses three text encoders, one if which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware.

 ### Running Inference with Model Offloading

@@ -45,11 +45,12 @@ transformer = FluxTransformer2DModel.from_single_file(
 pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
+    generator=torch.manual_seed(0),
    torch_dtype=torch.bfloat16,
 )
 pipe.enable_model_cpu_offload()
 prompt = "A cat holding a sign that says hello world"
-image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
+image = pipe(prompt).images[0]
 image.save("flux-gguf.png")
 ```

@@ -33,8 +33,8 @@ If you are new to the quantization field, we recommend you to check out these be
 ## When to use what?

 Diffusers currently supports the following quantization methods.
- [BitsandBytes](./bitsandbytes)
- [TorchAO](./torchao)
- [GGUF](./gguf)
+- [BitsandBytes](./bitsandbytes.md)
+- [TorchAO](./torchao.md)
+- [GGUF](./gguf.md)

 [This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.
@@ -25,10 +25,9 @@ Quantize a model by passing [`TorchAoConfig`] to [`~ModelMixin.from_pretrained`]
 The example below only quantizes the weights to int8.

 ```python
-import torch
 from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig

-model_id = "black-forest-labs/FLUX.1-dev"
+model_id = "black-forest-labs/Flux.1-Dev"
 dtype = torch.bfloat16

 quantization_config = TorchAoConfig("int8wo")
@@ -45,14 +44,8 @@ pipe = FluxPipeline.from_pretrained(
 )
 pipe.to("cuda")

-# Without quantization: ~31.447 GB
-# With quantization: ~20.40 GB
-print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
-
 prompt = "A cat holding a sign that says hello world"
-image = pipe(
-    prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
-).images[0]
+image = pipe(prompt, num_inference_steps=28, guidance_scale=0.0).images[0]
 image.save("output.png")
 ```

@@ -93,63 +86,6 @@ Some quantization methods are aliases (for example, `int8wo` is the commonly use

 Refer to the official torchao documentation for a better understanding of the available quantization methods and the exhaustive list of configuration options available.

-## Serializing and Deserializing quantized models
-
-To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.
-
-```python
-import torch
-from diffusers import FluxTransformer2DModel, TorchAoConfig
-
-quantization_config = TorchAoConfig("int8wo")
-transformer = FluxTransformer2DModel.from_pretrained(
-    "black-forest-labs/Flux.1-Dev",
-    subfolder="transformer",
-    quantization_config=quantization_config,
-    torch_dtype=torch.bfloat16,
-)
-transformer.save_pretrained("/path/to/flux_int8wo", safe_serialization=False)
-```
-
-To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method.
-
-```python
-import torch
-from diffusers import FluxPipeline, FluxTransformer2DModel
-
-transformer = FluxTransformer2DModel.from_pretrained("/path/to/flux_int8wo", torch_dtype=torch.bfloat16, use_safetensors=False)
-pipe = FluxPipeline.from_pretrained("black-forest-labs/Flux.1-Dev", transformer=transformer, torch_dtype=torch.bfloat16)
-pipe.to("cuda")
-
-prompt = "A cat holding a sign that says hello world"
-image = pipe(prompt, num_inference_steps=30, guidance_scale=7.0).images[0]
-image.save("output.png")
-```
-
-Some quantization methods, such as `uint4wo`, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them. In order to work around this, one can load the state dict manually into the model. Note, however, that this requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source.
-
-```python
-import torch
-from accelerate import init_empty_weights
-from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig
-
-# Serialize the model
-transformer = FluxTransformer2DModel.from_pretrained(
-    "black-forest-labs/Flux.1-Dev",
-    subfolder="transformer",
-    quantization_config=TorchAoConfig("uint4wo"),
-    torch_dtype=torch.bfloat16,
-)
-transformer.save_pretrained("/path/to/flux_uint4wo", safe_serialization=False, max_shard_size="50GB")
-# ...
-
-# Load the model
-state_dict = torch.load("/path/to/flux_uint4wo/diffusion_pytorch_model.bin", weights_only=False, map_location="cpu")
-with init_empty_weights():
-    transformer = FluxTransformer2DModel.from_config("/path/to/flux_uint4wo/config.json")
-transformer.load_state_dict(state_dict, strict=True, assign=True)
-```
-
 ## Resources

 - [TorchAO Quantization API](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md)
@@ -74,7 +74,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -73,7 +73,7 @@ from diffusers.utils.import_utils import is_xformers_available


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -79,7 +79,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -61,7 +61,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -52,7 +52,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -43,7 +43,8 @@ from diffusers.utils import BaseOutput, check_min_version


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")
+

 class MarigoldDepthOutput(BaseOutput):
    """
@@ -73,7 +73,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -66,7 +66,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -79,7 +79,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -72,7 +72,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -78,7 +78,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -60,7 +60,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -60,7 +60,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = logging.getLogger(__name__)

@@ -65,7 +65,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)
 if is_torch_npu_available():
@@ -59,7 +59,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -61,7 +61,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)
 if is_torch_npu_available():
@@ -63,7 +63,7 @@ from diffusers.utils.import_utils import is_xformers_available


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -73,7 +73,7 @@ This will also allow us to push the trained LoRA parameters to the Hugging Face
 Now, we can launch training using:

 ```bash
-export MODEL_NAME="Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers"
+export MODEL_NAME="Efficient-Large-Model/Sana_1600M_1024px_diffusers"
 export INSTANCE_DIR="dog"
 export OUTPUT_DIR="trained-sana-lora"

@@ -124,4 +124,4 @@ We provide several options for optimizing memory optimization:
 * `cache_latents`: When enabled, we will pre-compute the latents from the input images with the VAE and remove the VAE from memory once done.
 * `--use_8bit_adam`: When enabled, we will use the 8bit version of AdamW provided by the `bitsandbytes` library.

-Refer to the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana) of the `SanaPipeline` to know more about the models available under the SANA family and their preferred dtypes during inference.
+Refer to the [official documentation](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana) of the `SanaPipeline` to know more about the models available under the SANA family and their preferred dtypes during inference.
@@ -1,206 +0,0 @@
-# coding=utf-8
-# Copyright 2024 HuggingFace Inc.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import os
-import sys
-import tempfile
-
-import safetensors
-
-
-sys.path.append("..")
-from test_examples_utils import ExamplesTestsAccelerate, run_command  # noqa: E402
-
-
-logging.basicConfig(level=logging.DEBUG)
-
-logger = logging.getLogger()
-stream_handler = logging.StreamHandler(sys.stdout)
-logger.addHandler(stream_handler)
-
-
-class DreamBoothLoRASANA(ExamplesTestsAccelerate):
-    instance_data_dir = "docs/source/en/imgs"
-    pretrained_model_name_or_path = "hf-internal-testing/tiny-sana-pipe"
-    script_path = "examples/dreambooth/train_dreambooth_lora_sana.py"
-    transformer_layer_type = "transformer_blocks.0.attn1.to_k"
-
-    def test_dreambooth_lora_sana(self):
-        with tempfile.TemporaryDirectory() as tmpdir:
-            test_args = f"""
-                {self.script_path}
-                --pretrained_model_name_or_path {self.pretrained_model_name_or_path}
-                --instance_data_dir {self.instance_data_dir}
-                --resolution 32
-                --train_batch_size 1
-                --gradient_accumulation_steps 1
-                --max_train_steps 2
-                --learning_rate 5.0e-04
-                --scale_lr
-                --lr_scheduler constant
-                --lr_warmup_steps 0
-                --output_dir {tmpdir}
-                --max_sequence_length 16
-                """.split()
-
-            test_args.extend(["--instance_prompt", ""])
-            run_command(self._launch_args + test_args)
-            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))
-
-            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
-            is_lora = all("lora" in k for k in lora_state_dict.keys())
-            self.assertTrue(is_lora)
-
-            # when not training the text encoder, all the parameters in the state dict should start
-            # with `"transformer"` in their names.
-            starts_with_transformer = all(key.startswith("transformer") for key in lora_state_dict.keys())
-            self.assertTrue(starts_with_transformer)
-
-    def test_dreambooth_lora_latent_caching(self):
-        with tempfile.TemporaryDirectory() as tmpdir:
-            test_args = f"""
-                {self.script_path}
-                --pretrained_model_name_or_path {self.pretrained_model_name_or_path}
-                --instance_data_dir {self.instance_data_dir}
-                --resolution 32
-                --train_batch_size 1
-                --gradient_accumulation_steps 1
-                --max_train_steps 2
-                --cache_latents
-                --learning_rate 5.0e-04
-                --scale_lr
-                --lr_scheduler constant
-                --lr_warmup_steps 0
-                --output_dir {tmpdir}
-                --max_sequence_length 16
-                """.split()
-
-            test_args.extend(["--instance_prompt", ""])
-            run_command(self._launch_args + test_args)
-            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))
-
-            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
-            is_lora = all("lora" in k for k in lora_state_dict.keys())
-            self.assertTrue(is_lora)
-
-            # when not training the text encoder, all the parameters in the state dict should start
-            # with `"transformer"` in their names.
-            starts_with_transformer = all(key.startswith("transformer") for key in lora_state_dict.keys())
-            self.assertTrue(starts_with_transformer)
-
-    def test_dreambooth_lora_layers(self):
-        with tempfile.TemporaryDirectory() as tmpdir:
-            test_args = f"""
-                {self.script_path}
-                --pretrained_model_name_or_path {self.pretrained_model_name_or_path}
-                --instance_data_dir {self.instance_data_dir}
-                --resolution 32
-                --train_batch_size 1
-                --gradient_accumulation_steps 1
-                --max_train_steps 2
-                --cache_latents
-                --learning_rate 5.0e-04
-                --scale_lr
-                --lora_layers {self.transformer_layer_type}
-                --lr_scheduler constant
-                --lr_warmup_steps 0
-                --output_dir {tmpdir}
-                --max_sequence_length 16
-                """.split()
-
-            test_args.extend(["--instance_prompt", ""])
-            run_command(self._launch_args + test_args)
-            # save_pretrained smoke test
-            self.assertTrue(os.path.isfile(os.path.join(tmpdir, "pytorch_lora_weights.safetensors")))
-
-            # make sure the state_dict has the correct naming in the parameters.
-            lora_state_dict = safetensors.torch.load_file(os.path.join(tmpdir, "pytorch_lora_weights.safetensors"))
-            is_lora = all("lora" in k for k in lora_state_dict.keys())
-            self.assertTrue(is_lora)
-
-            # when not training the text encoder, all the parameters in the state dict should start
-            # with `"transformer"` in their names. In this test, we only params of
-            # `self.transformer_layer_type` should be in the state dict.
-            starts_with_transformer = all(self.transformer_layer_type in key for key in lora_state_dict)
-            self.assertTrue(starts_with_transformer)
-
-    def test_dreambooth_lora_sana_checkpointing_checkpoints_total_limit(self):
-        with tempfile.TemporaryDirectory() as tmpdir:
-            test_args = f"""
-            {self.script_path}
-            --pretrained_model_name_or_path={self.pretrained_model_name_or_path}
-            --instance_data_dir={self.instance_data_dir}
-            --output_dir={tmpdir}
-            --resolution=32
-            --train_batch_size=1
-            --gradient_accumulation_steps=1
-            --max_train_steps=6
-            --checkpoints_total_limit=2
-            --checkpointing_steps=2
-            --max_sequence_length 16
-            """.split()
-
-            test_args.extend(["--instance_prompt", ""])
-            run_command(self._launch_args + test_args)
-
-            self.assertEqual(
-                {x for x in os.listdir(tmpdir) if "checkpoint" in x},
-                {"checkpoint-4", "checkpoint-6"},
-            )
-
-    def test_dreambooth_lora_sana_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints(self):
-        with tempfile.TemporaryDirectory() as tmpdir:
-            test_args = f"""
-            {self.script_path}
-            --pretrained_model_name_or_path={self.pretrained_model_name_or_path}
-            --instance_data_dir={self.instance_data_dir}
-            --output_dir={tmpdir}
-            --resolution=32
-            --train_batch_size=1
-            --gradient_accumulation_steps=1
-            --max_train_steps=4
-            --checkpointing_steps=2
-            --max_sequence_length 166
-            """.split()
-
-            test_args.extend(["--instance_prompt", ""])
-            run_command(self._launch_args + test_args)
-
-            self.assertEqual({x for x in os.listdir(tmpdir) if "checkpoint" in x}, {"checkpoint-2", "checkpoint-4"})
-
-            resume_run_args = f"""
-            {self.script_path}
-            --pretrained_model_name_or_path={self.pretrained_model_name_or_path}
-            --instance_data_dir={self.instance_data_dir}
-            --output_dir={tmpdir}
-            --resolution=32
-            --train_batch_size=1
-            --gradient_accumulation_steps=1
-            --max_train_steps=8
-            --checkpointing_steps=2
-            --resume_from_checkpoint=checkpoint-4
-            --checkpoints_total_limit=2
-            --max_sequence_length 16
-            """.split()
-
-            resume_run_args.extend(["--instance_prompt", ""])
-            run_command(self._launch_args + resume_run_args)
-
-            self.assertEqual({x for x in os.listdir(tmpdir) if "checkpoint" in x}, {"checkpoint-6", "checkpoint-8"})
@@ -63,7 +63,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -35,7 +35,7 @@ from diffusers.utils import check_min_version


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 # Cache compiled models across invocations of this script.
 cc.initialize_cache(os.path.expanduser("~/.cache/jax/compilation_cache"))
@@ -65,7 +65,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -70,7 +70,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -72,7 +72,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -70,7 +70,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -943,7 +943,7 @@ def main(args):

    # Load scheduler and models
    noise_scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
-        args.pretrained_model_name_or_path, subfolder="scheduler", revision=args.revision
+        args.pretrained_model_name_or_path, subfolder="scheduler"
    )
    noise_scheduler_copy = copy.deepcopy(noise_scheduler)
    text_encoder = Gemma2Model.from_pretrained(
@@ -964,6 +964,15 @@ def main(args):
    vae.requires_grad_(False)
    text_encoder.requires_grad_(False)

+    # Initialize a text encoding pipeline and keep it to CPU for now.
+    text_encoding_pipeline = SanaPipeline.from_pretrained(
+        args.pretrained_model_name_or_path,
+        vae=None,
+        transformer=None,
+        text_encoder=text_encoder,
+        tokenizer=tokenizer,
+    )
+
    # For mixed precision training we cast all non-trainable weights (vae, text_encoder and transformer) to half-precision
    # as these weights are only used for inference, keeping weights in full precision is not required.
    weight_dtype = torch.float32
@@ -984,15 +993,6 @@ def main(args):
    # because Gemma2 is particularly suited for bfloat16.
    text_encoder.to(dtype=torch.bfloat16)

-    # Initialize a text encoding pipeline and keep it to CPU for now.
-    text_encoding_pipeline = SanaPipeline.from_pretrained(
-        args.pretrained_model_name_or_path,
-        vae=None,
-        transformer=None,
-        text_encoder=text_encoder,
-        tokenizer=tokenizer,
-    )
-
    if args.gradient_checkpointing:
        transformer.enable_gradient_checkpointing()

@@ -1182,7 +1182,6 @@ def main(args):
            )
        if args.offload:
            text_encoding_pipeline = text_encoding_pipeline.to("cpu")
-        prompt_embeds = prompt_embeds.to(transformer.dtype)
        return prompt_embeds, prompt_attention_mask

    # If no type of tuning is done on the text_encoder and custom instance prompts are NOT
@@ -1217,7 +1216,7 @@ def main(args):
    vae_config_scaling_factor = vae.config.scaling_factor
    if args.cache_latents:
        latents_cache = []
-        vae = vae.to(accelerator.device)
+        vae = vae.to("cuda")
        for batch in tqdm(train_dataloader, desc="Caching latents"):
            with torch.no_grad():
                batch["pixel_values"] = batch["pixel_values"].to(
@@ -72,7 +72,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -79,7 +79,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -63,7 +63,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -54,7 +54,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -57,7 +57,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -57,7 +57,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -60,7 +60,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -52,7 +52,7 @@ if is_wandb_available():


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -46,7 +46,7 @@ from diffusers.utils import check_min_version, is_wandb_available


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -46,7 +46,7 @@ from diffusers.utils import check_min_version, is_wandb_available


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -51,7 +51,7 @@ if is_wandb_available():


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -60,7 +60,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -57,7 +57,7 @@ if is_wandb_available():


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -49,7 +49,7 @@ from diffusers.utils import check_min_version


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = logging.getLogger(__name__)

@@ -56,7 +56,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -68,7 +68,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)
 if is_torch_npu_available():
@@ -55,7 +55,7 @@ from diffusers.utils.torch_utils import is_compiled_module


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)
 if is_torch_npu_available():
@@ -81,7 +81,7 @@ else:


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -56,7 +56,7 @@ else:
 # ------------------------------------------------------------------------------

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = logging.getLogger(__name__)

@@ -76,7 +76,7 @@ else:


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__)

@@ -29,7 +29,7 @@ from diffusers.utils.import_utils import is_xformers_available


 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -50,7 +50,7 @@ if is_wandb_available():
    import wandb

 # Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.32.0")
+check_min_version("0.32.0.dev0")

 logger = get_logger(__name__, log_level="INFO")

@@ -1,97 +0,0 @@
-import argparse
-from contextlib import nullcontext
-
-import safetensors.torch
-from accelerate import init_empty_weights
-from huggingface_hub import hf_hub_download
-
-from diffusers.utils.import_utils import is_accelerate_available, is_transformers_available
-
-
-if is_transformers_available():
-    from transformers import CLIPVisionModelWithProjection
-
-    vision = True
-else:
-    vision = False
-
-"""
-python scripts/convert_flux_xlabs_ipadapter_to_diffusers.py  \
--original_state_dict_repo_id "XLabs-AI/flux-ip-adapter" \
--filename "flux-ip-adapter.safetensors"
--output_path "flux-ip-adapter-hf/"
-"""
-
-
-CTX = init_empty_weights if is_accelerate_available else nullcontext
-
-parser = argparse.ArgumentParser()
-parser.add_argument("--original_state_dict_repo_id", default=None, type=str)
-parser.add_argument("--filename", default="flux.safetensors", type=str)
-parser.add_argument("--checkpoint_path", default=None, type=str)
-parser.add_argument("--output_path", type=str)
-parser.add_argument("--vision_pretrained_or_path", default="openai/clip-vit-large-patch14", type=str)
-
-args = parser.parse_args()
-
-
-def load_original_checkpoint(args):
-    if args.original_state_dict_repo_id is not None:
-        ckpt_path = hf_hub_download(repo_id=args.original_state_dict_repo_id, filename=args.filename)
-    elif args.checkpoint_path is not None:
-        ckpt_path = args.checkpoint_path
-    else:
-        raise ValueError(" please provide either `original_state_dict_repo_id` or a local `checkpoint_path`")
-
-    original_state_dict = safetensors.torch.load_file(ckpt_path)
-    return original_state_dict
-
-
-def convert_flux_ipadapter_checkpoint_to_diffusers(original_state_dict, num_layers):
-    converted_state_dict = {}
-
-    # image_proj
-    ## norm
-    converted_state_dict["image_proj.norm.weight"] = original_state_dict.pop("ip_adapter_proj_model.norm.weight")
-    converted_state_dict["image_proj.norm.bias"] = original_state_dict.pop("ip_adapter_proj_model.norm.bias")
-    ## proj
-    converted_state_dict["image_proj.proj.weight"] = original_state_dict.pop("ip_adapter_proj_model.norm.weight")
-    converted_state_dict["image_proj.proj.bias"] = original_state_dict.pop("ip_adapter_proj_model.norm.bias")
-
-    # double transformer blocks
-    for i in range(num_layers):
-        block_prefix = f"ip_adapter.{i}."
-        # to_k_ip
-        converted_state_dict[f"{block_prefix}to_k_ip.bias"] = original_state_dict.pop(
-            f"double_blocks.{i}.processor.ip_adapter_double_stream_k_proj.bias"
-        )
-        converted_state_dict[f"{block_prefix}to_k_ip.weight"] = original_state_dict.pop(
-            f"double_blocks.{i}.processor.ip_adapter_double_stream_k_proj.weight"
-        )
-        # to_v_ip
-        converted_state_dict[f"{block_prefix}to_v_ip.bias"] = original_state_dict.pop(
-            f"double_blocks.{i}.processor.ip_adapter_double_stream_v_proj.bias"
-        )
-        converted_state_dict[f"{block_prefix}to_k_ip.weight"] = original_state_dict.pop(
-            f"double_blocks.{i}.processor.ip_adapter_double_stream_v_proj.weight"
-        )
-
-    return converted_state_dict
-
-
-def main(args):
-    original_ckpt = load_original_checkpoint(args)
-
-    num_layers = 19
-    converted_ip_adapter_state_dict = convert_flux_ipadapter_checkpoint_to_diffusers(original_ckpt, num_layers)
-
-    print("Saving Flux IP-Adapter in Diffusers format.")
-    safetensors.torch.save_file(converted_ip_adapter_state_dict, f"{args.output_path}/model.safetensors")
-
-    if vision:
-        model = CLIPVisionModelWithProjection.from_pretrained(args.vision_pretrained_or_path)
-        model.save_pretrained(f"{args.output_path}/image_encoder")
-
-
-if __name__ == "__main__":
-    main(args)
@@ -1,9 +1,7 @@
 import argparse
-from pathlib import Path
 from typing import Any, Dict

 import torch
-from accelerate import init_empty_weights
 from safetensors.torch import load_file
 from transformers import T5EncoderModel, T5Tokenizer

@@ -23,9 +21,7 @@ TRANSFORMER_KEYS_RENAME_DICT = {
    "k_norm": "norm_k",
 }

-TRANSFORMER_SPECIAL_KEYS_REMAP = {
-    "vae": remove_keys_,
-}
+TRANSFORMER_SPECIAL_KEYS_REMAP = {}

 VAE_KEYS_RENAME_DICT = {
    # decoder
@@ -58,31 +54,10 @@ VAE_KEYS_RENAME_DICT = {
    "per_channel_statistics.std-of-means": "latents_std",
 }

-VAE_091_RENAME_DICT = {
-    # decoder
-    "up_blocks.0": "mid_block",
-    "up_blocks.1": "up_blocks.0.upsamplers.0",
-    "up_blocks.2": "up_blocks.0",
-    "up_blocks.3": "up_blocks.1.upsamplers.0",
-    "up_blocks.4": "up_blocks.1",
-    "up_blocks.5": "up_blocks.2.upsamplers.0",
-    "up_blocks.6": "up_blocks.2",
-    "up_blocks.7": "up_blocks.3.upsamplers.0",
-    "up_blocks.8": "up_blocks.3",
-    # common
-    "last_time_embedder": "time_embedder",
-    "last_scale_shift_table": "scale_shift_table",
-}
-
 VAE_SPECIAL_KEYS_REMAP = {
    "per_channel_statistics.channel": remove_keys_,
    "per_channel_statistics.mean-of-means": remove_keys_,
    "per_channel_statistics.mean-of-stds": remove_keys_,
-    "model.diffusion_model": remove_keys_,
-}
-
-VAE_091_SPECIAL_KEYS_REMAP = {
-    "timestep_scale_multiplier": remove_keys_,
 }


@@ -105,16 +80,13 @@ def convert_transformer(
    ckpt_path: str,
    dtype: torch.dtype,
 ):
-    PREFIX_KEY = "model.diffusion_model."
+    PREFIX_KEY = ""

    original_state_dict = get_state_dict(load_file(ckpt_path))
-    with init_empty_weights():
-        transformer = LTXVideoTransformer3DModel()
+    transformer = LTXVideoTransformer3DModel().to(dtype=dtype)

    for key in list(original_state_dict.keys()):
-        new_key = key[:]
-        if new_key.startswith(PREFIX_KEY):
-            new_key = key[len(PREFIX_KEY) :]
+        new_key = key[len(PREFIX_KEY) :]
        for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
            new_key = new_key.replace(replace_key, rename_key)
        update_state_dict_inplace(original_state_dict, key, new_key)
@@ -125,21 +97,16 @@ def convert_transformer(
                continue
            handler_fn_inplace(key, original_state_dict)

-    transformer.load_state_dict(original_state_dict, strict=True, assign=True)
+    transformer.load_state_dict(original_state_dict, strict=True)
    return transformer


-def convert_vae(ckpt_path: str, config, dtype: torch.dtype):
-    PREFIX_KEY = "vae."
-
+def convert_vae(ckpt_path: str, dtype: torch.dtype):
    original_state_dict = get_state_dict(load_file(ckpt_path))
-    with init_empty_weights():
-        vae = AutoencoderKLLTXVideo(**config)
+    vae = AutoencoderKLLTXVideo().to(dtype=dtype)

    for key in list(original_state_dict.keys()):
        new_key = key[:]
-        if new_key.startswith(PREFIX_KEY):
-            new_key = key[len(PREFIX_KEY) :]
        for replace_key, rename_key in VAE_KEYS_RENAME_DICT.items():
            new_key = new_key.replace(replace_key, rename_key)
        update_state_dict_inplace(original_state_dict, key, new_key)
@@ -150,60 +117,10 @@ def convert_vae(ckpt_path: str, config, dtype: torch.dtype):
                continue
            handler_fn_inplace(key, original_state_dict)

-    vae.load_state_dict(original_state_dict, strict=True, assign=True)
+    vae.load_state_dict(original_state_dict, strict=True)
    return vae


-def get_vae_config(version: str) -> Dict[str, Any]:
-    if version == "0.9.0":
-        config = {
-            "in_channels": 3,
-            "out_channels": 3,
-            "latent_channels": 128,
-            "block_out_channels": (128, 256, 512, 512),
-            "decoder_block_out_channels": (128, 256, 512, 512),
-            "layers_per_block": (4, 3, 3, 3, 4),
-            "decoder_layers_per_block": (4, 3, 3, 3, 4),
-            "spatio_temporal_scaling": (True, True, True, False),
-            "decoder_spatio_temporal_scaling": (True, True, True, False),
-            "decoder_inject_noise": (False, False, False, False, False),
-            "upsample_residual": (False, False, False, False),
-            "upsample_factor": (1, 1, 1, 1),
-            "patch_size": 4,
-            "patch_size_t": 1,
-            "resnet_norm_eps": 1e-6,
-            "scaling_factor": 1.0,
-            "encoder_causal": True,
-            "decoder_causal": False,
-            "timestep_conditioning": False,
-        }
-    elif version == "0.9.1":
-        config = {
-            "in_channels": 3,
-            "out_channels": 3,
-            "latent_channels": 128,
-            "block_out_channels": (128, 256, 512, 512),
-            "decoder_block_out_channels": (256, 512, 1024),
-            "layers_per_block": (4, 3, 3, 3, 4),
-            "decoder_layers_per_block": (5, 6, 7, 8),
-            "spatio_temporal_scaling": (True, True, True, False),
-            "decoder_spatio_temporal_scaling": (True, True, True),
-            "decoder_inject_noise": (True, True, True, False),
-            "upsample_residual": (True, True, True),
-            "upsample_factor": (2, 2, 2),
-            "timestep_conditioning": True,
-            "patch_size": 4,
-            "patch_size_t": 1,
-            "resnet_norm_eps": 1e-6,
-            "scaling_factor": 1.0,
-            "encoder_causal": True,
-            "decoder_causal": False,
-        }
-        VAE_KEYS_RENAME_DICT.update(VAE_091_RENAME_DICT)
-        VAE_SPECIAL_KEYS_REMAP.update(VAE_091_SPECIAL_KEYS_REMAP)
-    return config
-
-
 def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
@@ -222,9 +139,6 @@ def get_args():
    parser.add_argument("--save_pipeline", action="store_true")
    parser.add_argument("--output_path", type=str, required=True, help="Path where converted model should be saved")
    parser.add_argument("--dtype", default="fp32", help="Torch dtype to save the model in.")
-    parser.add_argument(
-        "--version", type=str, default="0.9.0", choices=["0.9.0", "0.9.1"], help="Version of the LTX model"
-    )
    return parser.parse_args()


@@ -247,7 +161,6 @@ if __name__ == "__main__":
    transformer = None
    dtype = DTYPE_MAPPING[args.dtype]
    variant = VARIANT_MAPPING[args.dtype]
-    output_path = Path(args.output_path)

    if args.save_pipeline:
        assert args.transformer_ckpt_path is not None and args.vae_ckpt_path is not None
@@ -256,14 +169,13 @@ if __name__ == "__main__":
        transformer: LTXVideoTransformer3DModel = convert_transformer(args.transformer_ckpt_path, dtype)
        if not args.save_pipeline:
            transformer.save_pretrained(
-                output_path / "transformer", safe_serialization=True, max_shard_size="5GB", variant=variant
+                args.output_path, safe_serialization=True, max_shard_size="5GB", variant=variant
            )

    if args.vae_ckpt_path is not None:
-        config = get_vae_config(args.version)
-        vae: AutoencoderKLLTXVideo = convert_vae(args.vae_ckpt_path, config, dtype)
+        vae: AutoencoderKLLTXVideo = convert_vae(args.vae_ckpt_path, dtype)
        if not args.save_pipeline:
-            vae.save_pretrained(output_path / "vae", safe_serialization=True, max_shard_size="5GB", variant=variant)
+            vae.save_pretrained(args.output_path, safe_serialization=True, max_shard_size="5GB", variant=variant)

    if args.save_pipeline:
        text_encoder_id = "google/t5-v1_1-xxl"
@@ -25,7 +25,6 @@ from diffusers.utils.import_utils import is_accelerate_available
 CTX = init_empty_weights if is_accelerate_available else nullcontext

 ckpt_ids = [
-    "Efficient-Large-Model/Sana_1600M_2Kpx_BF16/checkpoints/Sana_1600M_2Kpx_BF16.pth",
    "Efficient-Large-Model/Sana_1600M_1024px_MultiLing/checkpoints/Sana_1600M_1024px_MultiLing.pth",
    "Efficient-Large-Model/Sana_1600M_1024px_BF16/checkpoints/Sana_1600M_1024px_BF16.pth",
    "Efficient-Large-Model/Sana_1600M_512px_MultiLing/checkpoints/Sana_1600M_512px_MultiLing.pth",
@@ -88,18 +87,13 @@ def main(args):
    # y norm
    converted_state_dict["caption_norm.weight"] = state_dict.pop("attention_y_norm.weight")

-    # scheduler
    flow_shift = 3.0
-
-    # model config
    if args.model_type == "SanaMS_1600M_P1_D20":
        layer_num = 20
    elif args.model_type == "SanaMS_600M_P1_D28":
        layer_num = 28
    else:
        raise ValueError(f"{args.model_type} is not supported.")
-    # Positional embedding interpolation scale.
-    interpolation_scale = {512: None, 1024: None, 2048: 1.0}

    for depth in range(layer_num):
        # Transformer blocks.
@@ -181,7 +175,6 @@ def main(args):
            patch_size=1,
            norm_elementwise_affine=False,
            norm_eps=1e-6,
-            interpolation_scale=interpolation_scale[args.image_size],
        )

    if is_accelerate_available():
@@ -272,9 +265,9 @@ if __name__ == "__main__":
        "--image_size",
        default=1024,
        type=int,
-        choices=[512, 1024, 2048],
+        choices=[512, 1024],
        required=False,
-        help="Image size of pretrained model, 512, 1024 or 2048.",
+        help="Image size of pretrained model, 512 or 1024.",
    )
    parser.add_argument(
        "--model_type", default="SanaMS_1600M_P1_D20", type=str, choices=["SanaMS_1600M_P1_D20", "SanaMS_600M_P1_D28"]
@@ -254,7 +254,7 @@ version_range_max = max(sys.version_info[1], 10) + 1

 setup(
    name="diffusers",
-    version="0.32.2",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
+    version="0.32.0.dev0",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
    description="State-of-the-art diffusion in PyTorch and JAX.",
    long_description=open("README.md", "r", encoding="utf-8").read(),
    long_description_content_type="text/markdown",
@@ -1,4 +1,4 @@
-__version__ = "0.32.2"
+__version__ = "0.32.0.dev0"

 from typing import TYPE_CHECKING

@@ -277,7 +277,6 @@ else:
            "CogView3PlusPipeline",
            "CycleDiffusionPipeline",
            "FluxControlImg2ImgPipeline",
-            "FluxControlInpaintPipeline",
            "FluxControlNetImg2ImgPipeline",
            "FluxControlNetInpaintPipeline",
            "FluxControlNetPipeline",
@@ -766,7 +765,6 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            CogView3PlusPipeline,
            CycleDiffusionPipeline,
            FluxControlImg2ImgPipeline,
-            FluxControlInpaintPipeline,
            FluxControlNetImg2ImgPipeline,
            FluxControlNetInpaintPipeline,
            FluxControlNetPipeline,
@@ -55,8 +55,7 @@ _import_structure = {}

 if is_torch_available():
    _import_structure["single_file_model"] = ["FromOriginalModelMixin"]
-    _import_structure["transformer_flux"] = ["FluxTransformer2DLoadersMixin"]
-    _import_structure["transformer_sd3"] = ["SD3Transformer2DLoadersMixin"]
+
    _import_structure["unet"] = ["UNet2DConditionLoadersMixin"]
    _import_structure["utils"] = ["AttnProcsLayers"]
    if is_transformers_available():
@@ -71,15 +70,10 @@ if is_torch_available():
            "FluxLoraLoaderMixin",
            "CogVideoXLoraLoaderMixin",
            "Mochi1LoraLoaderMixin",
-            "HunyuanVideoLoraLoaderMixin",
            "SanaLoraLoaderMixin",
        ]
        _import_structure["textual_inversion"] = ["TextualInversionLoaderMixin"]
-        _import_structure["ip_adapter"] = [
-            "IPAdapterMixin",
-            "FluxIPAdapterMixin",
-            "SD3IPAdapterMixin",
-        ]
+        _import_structure["ip_adapter"] = ["IPAdapterMixin"]

 _import_structure["peft"] = ["PeftAdapterMixin"]

@@ -87,22 +81,15 @@ _import_structure["peft"] = ["PeftAdapterMixin"]
 if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
    if is_torch_available():
        from .single_file_model import FromOriginalModelMixin
-        from .transformer_flux import FluxTransformer2DLoadersMixin
-        from .transformer_sd3 import SD3Transformer2DLoadersMixin
        from .unet import UNet2DConditionLoadersMixin
        from .utils import AttnProcsLayers

        if is_transformers_available():
-            from .ip_adapter import (
-                FluxIPAdapterMixin,
-                IPAdapterMixin,
-                SD3IPAdapterMixin,
-            )
+            from .ip_adapter import IPAdapterMixin
            from .lora_pipeline import (
                AmusedLoraLoaderMixin,
                CogVideoXLoraLoaderMixin,
                FluxLoraLoaderMixin,
-                HunyuanVideoLoraLoaderMixin,
                LoraLoaderMixin,
                LTXVideoLoraLoaderMixin,
                Mochi1LoraLoaderMixin,
@@ -33,20 +33,15 @@ from .unet_loader_utils import _maybe_expand_lora_scales


 if is_transformers_available():
-    from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection, SiglipImageProcessor, SiglipVisionModel
-
-from ..models.attention_processor import (
-    AttnProcessor,
-    AttnProcessor2_0,
-    FluxAttnProcessor2_0,
-    FluxIPAdapterJointAttnProcessor2_0,
-    IPAdapterAttnProcessor,
-    IPAdapterAttnProcessor2_0,
-    IPAdapterXFormersAttnProcessor,
-    JointAttnProcessor2_0,
-    SD3IPAdapterJointAttnProcessor2_0,
-)
+    from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection

+    from ..models.attention_processor import (
+        AttnProcessor,
+        AttnProcessor2_0,
+        IPAdapterAttnProcessor,
+        IPAdapterAttnProcessor2_0,
+        IPAdapterXFormersAttnProcessor,
+    )

 logger = logging.get_logger(__name__)

@@ -353,519 +348,3 @@ class IPAdapterMixin:
                else value.__class__()
            )
        self.unet.set_attn_processor(attn_procs)
-
-
-class FluxIPAdapterMixin:
-    """Mixin for handling Flux IP Adapters."""
-
-    @validate_hf_hub_args
-    def load_ip_adapter(
-        self,
-        pretrained_model_name_or_path_or_dict: Union[str, List[str], Dict[str, torch.Tensor]],
-        weight_name: Union[str, List[str]],
-        subfolder: Optional[Union[str, List[str]]] = "",
-        image_encoder_pretrained_model_name_or_path: Optional[str] = "image_encoder",
-        image_encoder_subfolder: Optional[str] = "",
-        image_encoder_dtype: torch.dtype = torch.float16,
-        **kwargs,
-    ):
-        """
-        Parameters:
-            pretrained_model_name_or_path_or_dict (`str` or `List[str]` or `os.PathLike` or `List[os.PathLike]` or `dict` or `List[dict]`):
-                Can be either:
-
-                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
-                      the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-                    - A [torch state
-                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
-            subfolder (`str` or `List[str]`):
-                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
-                list is passed, it should have the same length as `weight_name`.
-            weight_name (`str` or `List[str]`):
-                The name of the weight file to load. If a list is passed, it should have the same length as
-                `weight_name`.
-            image_encoder_pretrained_model_name_or_path (`str`, *optional*, defaults to `./image_encoder`):
-                Can be either:
-
-                    - A string, the *model id* (for example `openai/clip-vit-large-patch14`) of a pretrained model
-                      hosted on the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to `True`, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-
-        # handle the list inputs for multiple IP Adapters
-        if not isinstance(weight_name, list):
-            weight_name = [weight_name]
-
-        if not isinstance(pretrained_model_name_or_path_or_dict, list):
-            pretrained_model_name_or_path_or_dict = [pretrained_model_name_or_path_or_dict]
-        if len(pretrained_model_name_or_path_or_dict) == 1:
-            pretrained_model_name_or_path_or_dict = pretrained_model_name_or_path_or_dict * len(weight_name)
-
-        if not isinstance(subfolder, list):
-            subfolder = [subfolder]
-        if len(subfolder) == 1:
-            subfolder = subfolder * len(weight_name)
-
-        if len(weight_name) != len(pretrained_model_name_or_path_or_dict):
-            raise ValueError("`weight_name` and `pretrained_model_name_or_path_or_dict` must have the same length.")
-
-        if len(weight_name) != len(subfolder):
-            raise ValueError("`weight_name` and `subfolder` must have the same length.")
-
-        # Load the main state dict first.
-        cache_dir = kwargs.pop("cache_dir", None)
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        local_files_only = kwargs.pop("local_files_only", None)
-        token = kwargs.pop("token", None)
-        revision = kwargs.pop("revision", None)
-        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
-
-        if low_cpu_mem_usage and not is_accelerate_available():
-            low_cpu_mem_usage = False
-            logger.warning(
-                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                " install accelerate\n```\n."
-            )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        user_agent = {
-            "file_type": "attn_procs_weights",
-            "framework": "pytorch",
-        }
-        state_dicts = []
-        for pretrained_model_name_or_path_or_dict, weight_name, subfolder in zip(
-            pretrained_model_name_or_path_or_dict, weight_name, subfolder
-        ):
-            if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                model_file = _get_model_file(
-                    pretrained_model_name_or_path_or_dict,
-                    weights_name=weight_name,
-                    cache_dir=cache_dir,
-                    force_download=force_download,
-                    proxies=proxies,
-                    local_files_only=local_files_only,
-                    token=token,
-                    revision=revision,
-                    subfolder=subfolder,
-                    user_agent=user_agent,
-                )
-                if weight_name.endswith(".safetensors"):
-                    state_dict = {"image_proj": {}, "ip_adapter": {}}
-                    with safe_open(model_file, framework="pt", device="cpu") as f:
-                        image_proj_keys = ["ip_adapter_proj_model.", "image_proj."]
-                        ip_adapter_keys = ["double_blocks.", "ip_adapter."]
-                        for key in f.keys():
-                            if any(key.startswith(prefix) for prefix in image_proj_keys):
-                                diffusers_name = ".".join(key.split(".")[1:])
-                                state_dict["image_proj"][diffusers_name] = f.get_tensor(key)
-                            elif any(key.startswith(prefix) for prefix in ip_adapter_keys):
-                                diffusers_name = (
-                                    ".".join(key.split(".")[1:])
-                                    .replace("ip_adapter_double_stream_k_proj", "to_k_ip")
-                                    .replace("ip_adapter_double_stream_v_proj", "to_v_ip")
-                                    .replace("processor.", "")
-                                )
-                                state_dict["ip_adapter"][diffusers_name] = f.get_tensor(key)
-                else:
-                    state_dict = load_state_dict(model_file)
-            else:
-                state_dict = pretrained_model_name_or_path_or_dict
-
-            keys = list(state_dict.keys())
-            if keys != ["image_proj", "ip_adapter"]:
-                raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
-
-            state_dicts.append(state_dict)
-
-            # load CLIP image encoder here if it has not been registered to the pipeline yet
-            if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
-                if image_encoder_pretrained_model_name_or_path is not None:
-                    if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                        logger.info(f"loading image_encoder from {image_encoder_pretrained_model_name_or_path}")
-                        image_encoder = (
-                            CLIPVisionModelWithProjection.from_pretrained(
-                                image_encoder_pretrained_model_name_or_path,
-                                subfolder=image_encoder_subfolder,
-                                low_cpu_mem_usage=low_cpu_mem_usage,
-                                cache_dir=cache_dir,
-                                local_files_only=local_files_only,
-                            )
-                            .to(self.device, dtype=image_encoder_dtype)
-                            .eval()
-                        )
-                        self.register_modules(image_encoder=image_encoder)
-                    else:
-                        raise ValueError(
-                            "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
-                        )
-                else:
-                    logger.warning(
-                        "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
-                        "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
-                    )
-
-            # create feature extractor if it has not been registered to the pipeline yet
-            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is None:
-                # FaceID IP adapters don't need the image encoder so it's not present, in this case we default to 224
-                default_clip_size = 224
-                clip_image_size = (
-                    self.image_encoder.config.image_size if self.image_encoder is not None else default_clip_size
-                )
-                feature_extractor = CLIPImageProcessor(size=clip_image_size, crop_size=clip_image_size)
-                self.register_modules(feature_extractor=feature_extractor)
-
-        # load ip-adapter into transformer
-        self.transformer._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
-
-    def set_ip_adapter_scale(self, scale: Union[float, List[float], List[List[float]]]):
-        """
-        Set IP-Adapter scales per-transformer block. Input `scale` could be a single config or a list of configs for
-        granular control over each IP-Adapter behavior. A config can be a float or a list.
-
-        `float` is converted to list and repeated for the number of blocks and the number of IP adapters. `List[float]`
-        length match the number of blocks, it is repeated for each IP adapter. `List[List[float]]` must match the
-        number of IP adapters and each must match the number of blocks.
-
-        Example:
-
-        ```py
-        # To use original IP-Adapter
-        scale = 1.0
-        pipeline.set_ip_adapter_scale(scale)
-
-
-        def LinearStrengthModel(start, finish, size):
-            return [(start + (finish - start) * (i / (size - 1))) for i in range(size)]
-
-
-        ip_strengths = LinearStrengthModel(0.3, 0.92, 19)
-        pipeline.set_ip_adapter_scale(ip_strengths)
-        ```
-        """
-        transformer = self.transformer
-        if not isinstance(scale, list):
-            scale = [[scale] * transformer.config.num_layers]
-        elif isinstance(scale, list) and isinstance(scale[0], int) or isinstance(scale[0], float):
-            if len(scale) != transformer.config.num_layers:
-                raise ValueError(f"Expected list of {transformer.config.num_layers} scales, got {len(scale)}.")
-            scale = [scale]
-
-        scale_configs = scale
-
-        key_id = 0
-        for attn_name, attn_processor in transformer.attn_processors.items():
-            if isinstance(attn_processor, (FluxIPAdapterJointAttnProcessor2_0)):
-                if len(scale_configs) != len(attn_processor.scale):
-                    raise ValueError(
-                        f"Cannot assign {len(scale_configs)} scale_configs to "
-                        f"{len(attn_processor.scale)} IP-Adapter."
-                    )
-                elif len(scale_configs) == 1:
-                    scale_configs = scale_configs * len(attn_processor.scale)
-                for i, scale_config in enumerate(scale_configs):
-                    attn_processor.scale[i] = scale_config[key_id]
-                key_id += 1
-
-    def unload_ip_adapter(self):
-        """
-        Unloads the IP Adapter weights
-
-        Examples:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.unload_ip_adapter()
-        >>> ...
-        ```
-        """
-        # remove CLIP image encoder
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
-            self.image_encoder = None
-            self.register_to_config(image_encoder=[None, None])
-
-        # remove feature extractor only when safety_checker is None as safety_checker uses
-        # the feature_extractor later
-        if not hasattr(self, "safety_checker"):
-            if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
-                self.feature_extractor = None
-                self.register_to_config(feature_extractor=[None, None])
-
-        # remove hidden encoder
-        self.transformer.encoder_hid_proj = None
-        self.transformer.config.encoder_hid_dim_type = None
-
-        # restore original Transformer attention processors layers
-        attn_procs = {}
-        for name, value in self.transformer.attn_processors.items():
-            attn_processor_class = FluxAttnProcessor2_0()
-            attn_procs[name] = (
-                attn_processor_class if isinstance(value, (FluxIPAdapterJointAttnProcessor2_0)) else value.__class__()
-            )
-        self.transformer.set_attn_processor(attn_procs)
-
-
-class SD3IPAdapterMixin:
-    """Mixin for handling StableDiffusion 3 IP Adapters."""
-
-    @property
-    def is_ip_adapter_active(self) -> bool:
-        """Checks if IP-Adapter is loaded and scale > 0.
-
-        IP-Adapter scale controls the influence of the image prompt versus text prompt. When this value is set to 0,
-        the image context is irrelevant.
-
-        Returns:
-            `bool`: True when IP-Adapter is loaded and any layer has scale > 0.
-        """
-        scales = [
-            attn_proc.scale
-            for attn_proc in self.transformer.attn_processors.values()
-            if isinstance(attn_proc, SD3IPAdapterJointAttnProcessor2_0)
-        ]
-
-        return len(scales) > 0 and any(scale > 0 for scale in scales)
-
-    @validate_hf_hub_args
-    def load_ip_adapter(
-        self,
-        pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]],
-        weight_name: str = "ip-adapter.safetensors",
-        subfolder: Optional[str] = None,
-        image_encoder_folder: Optional[str] = "image_encoder",
-        **kwargs,
-    ) -> None:
-        """
-        Parameters:
-            pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`):
-                Can be either:
-                    - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on
-                      the Hub.
-                    - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved
-                      with [`ModelMixin.save_pretrained`].
-                    - A [torch state
-                      dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict).
-            weight_name (`str`, defaults to "ip-adapter.safetensors"):
-                The name of the weight file to load. If a list is passed, it should have the same length as
-                `subfolder`.
-            subfolder (`str`, *optional*):
-                The subfolder location of a model file within a larger model repository on the Hub or locally. If a
-                list is passed, it should have the same length as `weight_name`.
-            image_encoder_folder (`str`, *optional*, defaults to `image_encoder`):
-                The subfolder location of the image encoder within a larger model repository on the Hub or locally.
-                Pass `None` to not load the image encoder. If the image encoder is located in a folder inside
-                `subfolder`, you only need to pass the name of the folder that contains image encoder weights, e.g.
-                `image_encoder_folder="image_encoder"`. If the image encoder is located in a folder other than
-                `subfolder`, you should pass the path to the folder that contains image encoder weights, for example,
-                `image_encoder_folder="different_subfolder/image_encoder"`.
-            cache_dir (`Union[str, os.PathLike]`, *optional*):
-                Path to a directory where a downloaded pretrained model configuration is cached if the standard cache
-                is not used.
-            force_download (`bool`, *optional*, defaults to `False`):
-                Whether or not to force the (re-)download of the model weights and configuration files, overriding the
-                cached versions if they exist.
-            proxies (`Dict[str, str]`, *optional*):
-                A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128',
-                'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request.
-            local_files_only (`bool`, *optional*, defaults to `False`):
-                Whether to only load local model weights and configuration files or not. If set to `True`, the model
-                won't be downloaded from the Hub.
-            token (`str` or *bool*, *optional*):
-                The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from
-                `diffusers-cli login` (stored in `~/.huggingface`) is used.
-            revision (`str`, *optional*, defaults to `"main"`):
-                The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier
-                allowed by Git.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-        # Load the main state dict first
-        cache_dir = kwargs.pop("cache_dir", None)
-        force_download = kwargs.pop("force_download", False)
-        proxies = kwargs.pop("proxies", None)
-        local_files_only = kwargs.pop("local_files_only", None)
-        token = kwargs.pop("token", None)
-        revision = kwargs.pop("revision", None)
-        low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT)
-
-        if low_cpu_mem_usage and not is_accelerate_available():
-            low_cpu_mem_usage = False
-            logger.warning(
-                "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                " install accelerate\n```\n."
-            )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        user_agent = {
-            "file_type": "attn_procs_weights",
-            "framework": "pytorch",
-        }
-
-        if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-            model_file = _get_model_file(
-                pretrained_model_name_or_path_or_dict,
-                weights_name=weight_name,
-                cache_dir=cache_dir,
-                force_download=force_download,
-                proxies=proxies,
-                local_files_only=local_files_only,
-                token=token,
-                revision=revision,
-                subfolder=subfolder,
-                user_agent=user_agent,
-            )
-            if weight_name.endswith(".safetensors"):
-                state_dict = {"image_proj": {}, "ip_adapter": {}}
-                with safe_open(model_file, framework="pt", device="cpu") as f:
-                    for key in f.keys():
-                        if key.startswith("image_proj."):
-                            state_dict["image_proj"][key.replace("image_proj.", "")] = f.get_tensor(key)
-                        elif key.startswith("ip_adapter."):
-                            state_dict["ip_adapter"][key.replace("ip_adapter.", "")] = f.get_tensor(key)
-            else:
-                state_dict = load_state_dict(model_file)
-        else:
-            state_dict = pretrained_model_name_or_path_or_dict
-
-        keys = list(state_dict.keys())
-        if "image_proj" not in keys and "ip_adapter" not in keys:
-            raise ValueError("Required keys are (`image_proj` and `ip_adapter`) missing from the state dict.")
-
-        # Load image_encoder and feature_extractor here if they haven't been registered to the pipeline yet
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is None:
-            if image_encoder_folder is not None:
-                if not isinstance(pretrained_model_name_or_path_or_dict, dict):
-                    logger.info(f"loading image_encoder from {pretrained_model_name_or_path_or_dict}")
-                    if image_encoder_folder.count("/") == 0:
-                        image_encoder_subfolder = Path(subfolder, image_encoder_folder).as_posix()
-                    else:
-                        image_encoder_subfolder = Path(image_encoder_folder).as_posix()
-
-                    # Commons args for loading image encoder and image processor
-                    kwargs = {
-                        "low_cpu_mem_usage": low_cpu_mem_usage,
-                        "cache_dir": cache_dir,
-                        "local_files_only": local_files_only,
-                    }
-
-                    self.register_modules(
-                        feature_extractor=SiglipImageProcessor.from_pretrained(image_encoder_subfolder, **kwargs).to(
-                            self.device, dtype=self.dtype
-                        ),
-                        image_encoder=SiglipVisionModel.from_pretrained(image_encoder_subfolder, **kwargs).to(
-                            self.device, dtype=self.dtype
-                        ),
-                    )
-                else:
-                    raise ValueError(
-                        "`image_encoder` cannot be loaded because `pretrained_model_name_or_path_or_dict` is a state dict."
-                    )
-            else:
-                logger.warning(
-                    "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
-                    "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
-                )
-
-        # Load IP-Adapter into transformer
-        self.transformer._load_ip_adapter_weights(state_dict, low_cpu_mem_usage=low_cpu_mem_usage)
-
-    def set_ip_adapter_scale(self, scale: float) -> None:
-        """
-        Set IP-Adapter scale, which controls image prompt conditioning. A value of 1.0 means the model is only
-        conditioned on the image prompt, and 0.0 only conditioned by the text prompt. Lowering this value encourages
-        the model to produce more diverse images, but they may not be as aligned with the image prompt.
-
-        Example:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.set_ip_adapter_scale(0.6)
-        >>> ...
-        ```
-
-        Args:
-            scale (float):
-                IP-Adapter scale to be set.
-
-        """
-        for attn_processor in self.transformer.attn_processors.values():
-            if isinstance(attn_processor, SD3IPAdapterJointAttnProcessor2_0):
-                attn_processor.scale = scale
-
-    def unload_ip_adapter(self) -> None:
-        """
-        Unloads the IP Adapter weights.
-
-        Example:
-
-        ```python
-        >>> # Assuming `pipeline` is already loaded with the IP Adapter weights.
-        >>> pipeline.unload_ip_adapter()
-        >>> ...
-        ```
-        """
-        # Remove image encoder
-        if hasattr(self, "image_encoder") and getattr(self, "image_encoder", None) is not None:
-            self.image_encoder = None
-            self.register_to_config(image_encoder=None)
-
-        # Remove feature extractor
-        if hasattr(self, "feature_extractor") and getattr(self, "feature_extractor", None) is not None:
-            self.feature_extractor = None
-            self.register_to_config(feature_extractor=None)
-
-        # Remove image projection
-        self.transformer.image_proj = None
-
-        # Restore original attention processors layers
-        attn_procs = {
-            name: (
-                JointAttnProcessor2_0() if isinstance(value, SD3IPAdapterJointAttnProcessor2_0) else value.__class__()
-            )
-            for name, value in self.transformer.attn_processors.items()
-        }
-        self.transformer.set_attn_processor(attn_procs)
@@ -28,20 +28,13 @@ from ..models.modeling_utils import ModelMixin, load_state_dict
 from ..utils import (
    USE_PEFT_BACKEND,
    _get_model_file,
-    convert_state_dict_to_diffusers,
-    convert_state_dict_to_peft,
    delete_adapter_layers,
    deprecate,
-    get_adapter_name,
-    get_peft_kwargs,
    is_accelerate_available,
    is_peft_available,
-    is_peft_version,
    is_transformers_available,
-    is_transformers_version,
    logging,
    recurse_remove_peft_layers,
-    scale_lora_layers,
    set_adapter_layers,
    set_weights_and_activate_adapters,
 )
@@ -50,8 +43,6 @@ from ..utils import (
 if is_transformers_available():
    from transformers import PreTrainedModel

-    from ..models.lora import text_encoder_attn_modules, text_encoder_mlp_modules
-
 if is_peft_available():
    from peft.tuners.tuners_utils import BaseTunerLayer

@@ -306,152 +297,6 @@ def _best_guess_weight_name(
    return weight_name


-def _load_lora_into_text_encoder(
-    state_dict,
-    network_alphas,
-    text_encoder,
-    prefix=None,
-    lora_scale=1.0,
-    text_encoder_name="text_encoder",
-    adapter_name=None,
-    _pipeline=None,
-    low_cpu_mem_usage=False,
-):
-    if not USE_PEFT_BACKEND:
-        raise ValueError("PEFT backend is required for this method.")
-
-    peft_kwargs = {}
-    if low_cpu_mem_usage:
-        if not is_peft_version(">=", "0.13.1"):
-            raise ValueError(
-                "`low_cpu_mem_usage=True` is not compatible with this `peft` version. Please update it with `pip install -U peft`."
-            )
-        if not is_transformers_version(">", "4.45.2"):
-            # Note from sayakpaul: It's not in `transformers` stable yet.
-            # https://github.com/huggingface/transformers/pull/33725/
-            raise ValueError(
-                "`low_cpu_mem_usage=True` is not compatible with this `transformers` version. Please update it with `pip install -U transformers`."
-            )
-        peft_kwargs["low_cpu_mem_usage"] = low_cpu_mem_usage
-
-    from peft import LoraConfig
-
-    # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918),
-    # then the `state_dict` keys should have `unet_name` and/or `text_encoder_name` as
-    # their prefixes.
-    keys = list(state_dict.keys())
-    prefix = text_encoder_name if prefix is None else prefix
-
-    # Safe prefix to check with.
-    if any(text_encoder_name in key for key in keys):
-        # Load the layers corresponding to text encoder and make necessary adjustments.
-        text_encoder_keys = [k for k in keys if k.startswith(prefix) and k.split(".")[0] == prefix]
-        text_encoder_lora_state_dict = {
-            k.replace(f"{prefix}.", ""): v for k, v in state_dict.items() if k in text_encoder_keys
-        }
-
-        if len(text_encoder_lora_state_dict) > 0:
-            logger.info(f"Loading {prefix}.")
-            rank = {}
-            text_encoder_lora_state_dict = convert_state_dict_to_diffusers(text_encoder_lora_state_dict)
-
-            # convert state dict
-            text_encoder_lora_state_dict = convert_state_dict_to_peft(text_encoder_lora_state_dict)
-
-            for name, _ in text_encoder_attn_modules(text_encoder):
-                for module in ("out_proj", "q_proj", "k_proj", "v_proj"):
-                    rank_key = f"{name}.{module}.lora_B.weight"
-                    if rank_key not in text_encoder_lora_state_dict:
-                        continue
-                    rank[rank_key] = text_encoder_lora_state_dict[rank_key].shape[1]
-
-            for name, _ in text_encoder_mlp_modules(text_encoder):
-                for module in ("fc1", "fc2"):
-                    rank_key = f"{name}.{module}.lora_B.weight"
-                    if rank_key not in text_encoder_lora_state_dict:
-                        continue
-                    rank[rank_key] = text_encoder_lora_state_dict[rank_key].shape[1]
-
-            if network_alphas is not None:
-                alpha_keys = [k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix]
-                network_alphas = {k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys}
-
-            lora_config_kwargs = get_peft_kwargs(rank, network_alphas, text_encoder_lora_state_dict, is_unet=False)
-
-            if "use_dora" in lora_config_kwargs:
-                if lora_config_kwargs["use_dora"]:
-                    if is_peft_version("<", "0.9.0"):
-                        raise ValueError(
-                            "You need `peft` 0.9.0 at least to use DoRA-enabled LoRAs. Please upgrade your installation of `peft`."
-                        )
-                else:
-                    if is_peft_version("<", "0.9.0"):
-                        lora_config_kwargs.pop("use_dora")
-
-            if "lora_bias" in lora_config_kwargs:
-                if lora_config_kwargs["lora_bias"]:
-                    if is_peft_version("<=", "0.13.2"):
-                        raise ValueError(
-                            "You need `peft` 0.14.0 at least to use `bias` in LoRAs. Please upgrade your installation of `peft`."
-                        )
-                else:
-                    if is_peft_version("<=", "0.13.2"):
-                        lora_config_kwargs.pop("lora_bias")
-
-            lora_config = LoraConfig(**lora_config_kwargs)
-
-            # adapter_name
-            if adapter_name is None:
-                adapter_name = get_adapter_name(text_encoder)
-
-            is_model_cpu_offload, is_sequential_cpu_offload = _func_optionally_disable_offloading(_pipeline)
-
-            # inject LoRA layers and load the state dict
-            # in transformers we automatically check whether the adapter name is already in use or not
-            text_encoder.load_adapter(
-                adapter_name=adapter_name,
-                adapter_state_dict=text_encoder_lora_state_dict,
-                peft_config=lora_config,
-                **peft_kwargs,
-            )
-
-            # scale LoRA layers with `lora_scale`
-            scale_lora_layers(text_encoder, weight=lora_scale)
-
-            text_encoder.to(device=text_encoder.device, dtype=text_encoder.dtype)
-
-            # Offload back.
-            if is_model_cpu_offload:
-                _pipeline.enable_model_cpu_offload()
-            elif is_sequential_cpu_offload:
-                _pipeline.enable_sequential_cpu_offload()
-            # Unsafe code />
-
-
-def _func_optionally_disable_offloading(_pipeline):
-    is_model_cpu_offload = False
-    is_sequential_cpu_offload = False
-
-    if _pipeline is not None and _pipeline.hf_device_map is None:
-        for _, component in _pipeline.components.items():
-            if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
-                if not is_model_cpu_offload:
-                    is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
-                if not is_sequential_cpu_offload:
-                    is_sequential_cpu_offload = (
-                        isinstance(component._hf_hook, AlignDevicesHook)
-                        or hasattr(component._hf_hook, "hooks")
-                        and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
-                    )
-
-                logger.info(
-                    "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again."
-                )
-                remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
-
-    return (is_model_cpu_offload, is_sequential_cpu_offload)
-
-
 class LoraBaseMixin:
    """Utility class for handling LoRAs."""

@@ -482,7 +327,27 @@ class LoraBaseMixin:
            tuple:
                A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
        """
-        return _func_optionally_disable_offloading(_pipeline=_pipeline)
+        is_model_cpu_offload = False
+        is_sequential_cpu_offload = False
+
+        if _pipeline is not None and _pipeline.hf_device_map is None:
+            for _, component in _pipeline.components.items():
+                if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
+                    if not is_model_cpu_offload:
+                        is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
+                    if not is_sequential_cpu_offload:
+                        is_sequential_cpu_offload = (
+                            isinstance(component._hf_hook, AlignDevicesHook)
+                            or hasattr(component._hf_hook, "hooks")
+                            and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
+                        )
+
+                    logger.info(
+                        "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again."
+                    )
+                    remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
+
+        return (is_model_cpu_offload, is_sequential_cpu_offload)

    @classmethod
    def _fetch_state_dict(cls, *args, **kwargs):
@@ -643,11 +643,7 @@ def _convert_xlabs_flux_lora_to_diffusers(old_state_dict):
                    old_state_dict,
                    new_state_dict,
                    old_key,
-                    [
-                        f"transformer.single_transformer_blocks.{block_num}.attn.to_q",
-                        f"transformer.single_transformer_blocks.{block_num}.attn.to_k",
-                        f"transformer.single_transformer_blocks.{block_num}.attn.to_v",
-                    ],
+                    [f"transformer.single_transformer_blocks.{block_num}.norm.linear"],
                )

            if "down" in old_key:
@@ -973,178 +969,3 @@ def _convert_bfl_flux_control_lora_to_diffusers(original_state_dict):
        converted_state_dict[f"transformer.{key}"] = converted_state_dict.pop(key)

    return converted_state_dict
-
-
-def _convert_hunyuan_video_lora_to_diffusers(original_state_dict):
-    converted_state_dict = {k: original_state_dict.pop(k) for k in list(original_state_dict.keys())}
-
-    def remap_norm_scale_shift_(key, state_dict):
-        weight = state_dict.pop(key)
-        shift, scale = weight.chunk(2, dim=0)
-        new_weight = torch.cat([scale, shift], dim=0)
-        state_dict[key.replace("final_layer.adaLN_modulation.1", "norm_out.linear")] = new_weight
-
-    def remap_txt_in_(key, state_dict):
-        def rename_key(key):
-            new_key = key.replace("individual_token_refiner.blocks", "token_refiner.refiner_blocks")
-            new_key = new_key.replace("adaLN_modulation.1", "norm_out.linear")
-            new_key = new_key.replace("txt_in", "context_embedder")
-            new_key = new_key.replace("t_embedder.mlp.0", "time_text_embed.timestep_embedder.linear_1")
-            new_key = new_key.replace("t_embedder.mlp.2", "time_text_embed.timestep_embedder.linear_2")
-            new_key = new_key.replace("c_embedder", "time_text_embed.text_embedder")
-            new_key = new_key.replace("mlp", "ff")
-            return new_key
-
-        if "self_attn_qkv" in key:
-            weight = state_dict.pop(key)
-            to_q, to_k, to_v = weight.chunk(3, dim=0)
-            state_dict[rename_key(key.replace("self_attn_qkv", "attn.to_q"))] = to_q
-            state_dict[rename_key(key.replace("self_attn_qkv", "attn.to_k"))] = to_k
-            state_dict[rename_key(key.replace("self_attn_qkv", "attn.to_v"))] = to_v
-        else:
-            state_dict[rename_key(key)] = state_dict.pop(key)
-
-    def remap_img_attn_qkv_(key, state_dict):
-        weight = state_dict.pop(key)
-        if "lora_A" in key:
-            state_dict[key.replace("img_attn_qkv", "attn.to_q")] = weight
-            state_dict[key.replace("img_attn_qkv", "attn.to_k")] = weight
-            state_dict[key.replace("img_attn_qkv", "attn.to_v")] = weight
-        else:
-            to_q, to_k, to_v = weight.chunk(3, dim=0)
-            state_dict[key.replace("img_attn_qkv", "attn.to_q")] = to_q
-            state_dict[key.replace("img_attn_qkv", "attn.to_k")] = to_k
-            state_dict[key.replace("img_attn_qkv", "attn.to_v")] = to_v
-
-    def remap_txt_attn_qkv_(key, state_dict):
-        weight = state_dict.pop(key)
-        if "lora_A" in key:
-            state_dict[key.replace("txt_attn_qkv", "attn.add_q_proj")] = weight
-            state_dict[key.replace("txt_attn_qkv", "attn.add_k_proj")] = weight
-            state_dict[key.replace("txt_attn_qkv", "attn.add_v_proj")] = weight
-        else:
-            to_q, to_k, to_v = weight.chunk(3, dim=0)
-            state_dict[key.replace("txt_attn_qkv", "attn.add_q_proj")] = to_q
-            state_dict[key.replace("txt_attn_qkv", "attn.add_k_proj")] = to_k
-            state_dict[key.replace("txt_attn_qkv", "attn.add_v_proj")] = to_v
-
-    def remap_single_transformer_blocks_(key, state_dict):
-        hidden_size = 3072
-
-        if "linear1.lora_A.weight" in key or "linear1.lora_B.weight" in key:
-            linear1_weight = state_dict.pop(key)
-            if "lora_A" in key:
-                new_key = key.replace("single_blocks", "single_transformer_blocks").removesuffix(
-                    ".linear1.lora_A.weight"
-                )
-                state_dict[f"{new_key}.attn.to_q.lora_A.weight"] = linear1_weight
-                state_dict[f"{new_key}.attn.to_k.lora_A.weight"] = linear1_weight
-                state_dict[f"{new_key}.attn.to_v.lora_A.weight"] = linear1_weight
-                state_dict[f"{new_key}.proj_mlp.lora_A.weight"] = linear1_weight
-            else:
-                split_size = (hidden_size, hidden_size, hidden_size, linear1_weight.size(0) - 3 * hidden_size)
-                q, k, v, mlp = torch.split(linear1_weight, split_size, dim=0)
-                new_key = key.replace("single_blocks", "single_transformer_blocks").removesuffix(
-                    ".linear1.lora_B.weight"
-                )
-                state_dict[f"{new_key}.attn.to_q.lora_B.weight"] = q
-                state_dict[f"{new_key}.attn.to_k.lora_B.weight"] = k
-                state_dict[f"{new_key}.attn.to_v.lora_B.weight"] = v
-                state_dict[f"{new_key}.proj_mlp.lora_B.weight"] = mlp
-
-        elif "linear1.lora_A.bias" in key or "linear1.lora_B.bias" in key:
-            linear1_bias = state_dict.pop(key)
-            if "lora_A" in key:
-                new_key = key.replace("single_blocks", "single_transformer_blocks").removesuffix(
-                    ".linear1.lora_A.bias"
-                )
-                state_dict[f"{new_key}.attn.to_q.lora_A.bias"] = linear1_bias
-                state_dict[f"{new_key}.attn.to_k.lora_A.bias"] = linear1_bias
-                state_dict[f"{new_key}.attn.to_v.lora_A.bias"] = linear1_bias
-                state_dict[f"{new_key}.proj_mlp.lora_A.bias"] = linear1_bias
-            else:
-                split_size = (hidden_size, hidden_size, hidden_size, linear1_bias.size(0) - 3 * hidden_size)
-                q_bias, k_bias, v_bias, mlp_bias = torch.split(linear1_bias, split_size, dim=0)
-                new_key = key.replace("single_blocks", "single_transformer_blocks").removesuffix(
-                    ".linear1.lora_B.bias"
-                )
-                state_dict[f"{new_key}.attn.to_q.lora_B.bias"] = q_bias
-                state_dict[f"{new_key}.attn.to_k.lora_B.bias"] = k_bias
-                state_dict[f"{new_key}.attn.to_v.lora_B.bias"] = v_bias
-                state_dict[f"{new_key}.proj_mlp.lora_B.bias"] = mlp_bias
-
-        else:
-            new_key = key.replace("single_blocks", "single_transformer_blocks")
-            new_key = new_key.replace("linear2", "proj_out")
-            new_key = new_key.replace("q_norm", "attn.norm_q")
-            new_key = new_key.replace("k_norm", "attn.norm_k")
-            state_dict[new_key] = state_dict.pop(key)
-
-    TRANSFORMER_KEYS_RENAME_DICT = {
-        "img_in": "x_embedder",
-        "time_in.mlp.0": "time_text_embed.timestep_embedder.linear_1",
-        "time_in.mlp.2": "time_text_embed.timestep_embedder.linear_2",
-        "guidance_in.mlp.0": "time_text_embed.guidance_embedder.linear_1",
-        "guidance_in.mlp.2": "time_text_embed.guidance_embedder.linear_2",
-        "vector_in.in_layer": "time_text_embed.text_embedder.linear_1",
-        "vector_in.out_layer": "time_text_embed.text_embedder.linear_2",
-        "double_blocks": "transformer_blocks",
-        "img_attn_q_norm": "attn.norm_q",
-        "img_attn_k_norm": "attn.norm_k",
-        "img_attn_proj": "attn.to_out.0",
-        "txt_attn_q_norm": "attn.norm_added_q",
-        "txt_attn_k_norm": "attn.norm_added_k",
-        "txt_attn_proj": "attn.to_add_out",
-        "img_mod.linear": "norm1.linear",
-        "img_norm1": "norm1.norm",
-        "img_norm2": "norm2",
-        "img_mlp": "ff",
-        "txt_mod.linear": "norm1_context.linear",
-        "txt_norm1": "norm1.norm",
-        "txt_norm2": "norm2_context",
-        "txt_mlp": "ff_context",
-        "self_attn_proj": "attn.to_out.0",
-        "modulation.linear": "norm.linear",
-        "pre_norm": "norm.norm",
-        "final_layer.norm_final": "norm_out.norm",
-        "final_layer.linear": "proj_out",
-        "fc1": "net.0.proj",
-        "fc2": "net.2",
-        "input_embedder": "proj_in",
-    }
-
-    TRANSFORMER_SPECIAL_KEYS_REMAP = {
-        "txt_in": remap_txt_in_,
-        "img_attn_qkv": remap_img_attn_qkv_,
-        "txt_attn_qkv": remap_txt_attn_qkv_,
-        "single_blocks": remap_single_transformer_blocks_,
-        "final_layer.adaLN_modulation.1": remap_norm_scale_shift_,
-    }
-
-    # Some folks attempt to make their state dict compatible with diffusers by adding "transformer." prefix to all keys
-    # and use their custom code. To make sure both "original" and "attempted diffusers" loras work as expected, we make
-    # sure that both follow the same initial format by stripping off the "transformer." prefix.
-    for key in list(converted_state_dict.keys()):
-        if key.startswith("transformer."):
-            converted_state_dict[key[len("transformer.") :]] = converted_state_dict.pop(key)
-        if key.startswith("diffusion_model."):
-            converted_state_dict[key[len("diffusion_model.") :]] = converted_state_dict.pop(key)
-
-    # Rename and remap the state dict keys
-    for key in list(converted_state_dict.keys()):
-        new_key = key[:]
-        for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
-            new_key = new_key.replace(replace_key, rename_key)
-        converted_state_dict[new_key] = converted_state_dict.pop(key)
-
-    for key in list(converted_state_dict.keys()):
-        for special_key, handler_fn_inplace in TRANSFORMER_SPECIAL_KEYS_REMAP.items():
-            if special_key not in key:
-                continue
-            handler_fn_inplace(key, converted_state_dict)
-
-    # Add back the "transformer." prefix
-    for key in list(converted_state_dict.keys()):
-        converted_state_dict[f"transformer.{key}"] = converted_state_dict.pop(key)
-
-    return converted_state_dict
@@ -20,6 +20,7 @@ from typing import Dict, List, Optional, Union

 import safetensors
 import torch
+import torch.nn as nn

 from ..utils import (
    MIN_PEFT_VERSION,
@@ -29,16 +30,20 @@ from ..utils import (
    delete_adapter_layers,
    get_adapter_name,
    get_peft_kwargs,
+    is_accelerate_available,
    is_peft_available,
    is_peft_version,
    logging,
    set_adapter_layers,
    set_weights_and_activate_adapters,
 )
-from .lora_base import _fetch_state_dict, _func_optionally_disable_offloading
+from .lora_base import _fetch_state_dict
 from .unet_loader_utils import _maybe_expand_lora_scales


+if is_accelerate_available():
+    from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module
+
 logger = logging.get_logger(__name__)

 _SET_ADAPTER_SCALE_FN_MAPPING = {
@@ -48,7 +53,6 @@ _SET_ADAPTER_SCALE_FN_MAPPING = {
    "FluxTransformer2DModel": lambda model_cls, weights: weights,
    "CogVideoXTransformer3DModel": lambda model_cls, weights: weights,
    "MochiTransformer3DModel": lambda model_cls, weights: weights,
-    "HunyuanVideoTransformer3DModel": lambda model_cls, weights: weights,
    "LTXVideoTransformer3DModel": lambda model_cls, weights: weights,
    "SanaTransformer2DModel": lambda model_cls, weights: weights,
 }
@@ -135,7 +139,27 @@ class PeftAdapterMixin:
            tuple:
                A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
        """
-        return _func_optionally_disable_offloading(_pipeline=_pipeline)
+        is_model_cpu_offload = False
+        is_sequential_cpu_offload = False
+
+        if _pipeline is not None and _pipeline.hf_device_map is None:
+            for _, component in _pipeline.components.items():
+                if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
+                    if not is_model_cpu_offload:
+                        is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
+                    if not is_sequential_cpu_offload:
+                        is_sequential_cpu_offload = (
+                            isinstance(component._hf_hook, AlignDevicesHook)
+                            or hasattr(component._hf_hook, "hooks")
+                            and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
+                        )
+
+                    logger.info(
+                        "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again."
+                    )
+                    remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
+
+        return (is_model_cpu_offload, is_sequential_cpu_offload)

    def load_lora_adapter(self, pretrained_model_name_or_path_or_dict, prefix="transformer", **kwargs):
        r"""
@@ -28,12 +28,10 @@ from .single_file_utils import (
    convert_autoencoder_dc_checkpoint_to_diffusers,
    convert_controlnet_checkpoint,
    convert_flux_transformer_checkpoint_to_diffusers,
-    convert_hunyuan_video_transformer_to_diffusers,
    convert_ldm_unet_checkpoint,
    convert_ldm_vae_checkpoint,
    convert_ltx_transformer_checkpoint_to_diffusers,
    convert_ltx_vae_checkpoint_to_diffusers,
-    convert_mochi_transformer_checkpoint_to_diffusers,
    convert_sd3_transformer_checkpoint_to_diffusers,
    convert_stable_cascade_unet_single_file_to_diffusers,
    create_controlnet_diffusers_config_from_ldm,
@@ -98,14 +96,6 @@ SINGLE_FILE_LOADABLE_CLASSES = {
        "default_subfolder": "vae",
    },
    "AutoencoderDC": {"checkpoint_mapping_fn": convert_autoencoder_dc_checkpoint_to_diffusers},
-    "MochiTransformer3DModel": {
-        "checkpoint_mapping_fn": convert_mochi_transformer_checkpoint_to_diffusers,
-        "default_subfolder": "transformer",
-    },
-    "HunyuanVideoTransformer3DModel": {
-        "checkpoint_mapping_fn": convert_hunyuan_video_transformer_to_diffusers,
-        "default_subfolder": "transformer",
-    },
 }


@@ -225,7 +215,6 @@ class FromOriginalModelMixin:
        local_files_only = kwargs.pop("local_files_only", None)
        subfolder = kwargs.pop("subfolder", None)
        revision = kwargs.pop("revision", None)
-        config_revision = kwargs.pop("config_revision", None)
        torch_dtype = kwargs.pop("torch_dtype", None)
        quantization_config = kwargs.pop("quantization_config", None)
        device = kwargs.pop("device", None)
@@ -303,7 +292,7 @@ class FromOriginalModelMixin:
                subfolder=subfolder,
                local_files_only=local_files_only,
                token=token,
-                revision=config_revision,
+                revision=revision,
            )
            expected_kwargs, optional_kwargs = cls._get_signature_keys(cls)

@@ -99,16 +99,13 @@ CHECKPOINT_KEY_NAMES = {
        "model.diffusion_model.double_blocks.0.img_attn.norm.key_norm.scale",
    ],
    "ltx-video": [
-        "model.diffusion_model.patchify_proj.weight",
-        "model.diffusion_model.transformer_blocks.27.scale_shift_table",
-        "patchify_proj.weight",
-        "transformer_blocks.27.scale_shift_table",
-        "vae.per_channel_statistics.mean-of-means",
+        (
+            "model.diffusion_model.patchify_proj.weight",
+            "model.diffusion_model.transformer_blocks.27.scale_shift_table",
+        ),
    ],
    "autoencoder-dc": "decoder.stages.1.op_list.0.main.conv.conv.bias",
    "autoencoder-dc-sana": "encoder.project_in.conv.bias",
-    "mochi-1-preview": ["model.diffusion_model.blocks.0.attn.qkv_x.weight", "blocks.0.attn.qkv_x.weight"],
-    "hunyuan-video": "txt_in.individual_token_refiner.blocks.0.adaLN_modulation.1.bias",
 }

 DIFFUSERS_DEFAULT_PIPELINE_PATHS = {
@@ -154,17 +151,12 @@ DIFFUSERS_DEFAULT_PIPELINE_PATHS = {
    "animatediff_scribble": {"pretrained_model_name_or_path": "guoyww/animatediff-sparsectrl-scribble"},
    "animatediff_rgb": {"pretrained_model_name_or_path": "guoyww/animatediff-sparsectrl-rgb"},
    "flux-dev": {"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev"},
-    "flux-fill": {"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-Fill-dev"},
-    "flux-depth": {"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-Depth-dev"},
    "flux-schnell": {"pretrained_model_name_or_path": "black-forest-labs/FLUX.1-schnell"},
-    "ltx-video": {"pretrained_model_name_or_path": "diffusers/LTX-Video-0.9.0"},
-    "ltx-video-0.9.1": {"pretrained_model_name_or_path": "diffusers/LTX-Video-0.9.1"},
+    "ltx-video": {"pretrained_model_name_or_path": "Lightricks/LTX-Video"},
    "autoencoder-dc-f128c512": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers"},
    "autoencoder-dc-f64c128": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers"},
    "autoencoder-dc-f32c32": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers"},
    "autoencoder-dc-f32c32-sana": {"pretrained_model_name_or_path": "mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers"},
-    "mochi-1-preview": {"pretrained_model_name_or_path": "genmo/mochi-1-preview"},
-    "hunyuan-video": {"pretrained_model_name_or_path": "hunyuanvideo-community/HunyuanVideo"},
 }

 # Use to configure model sample size when original config is provided
@@ -595,25 +587,12 @@ def infer_diffusers_model_type(checkpoint):
        if any(
            g in checkpoint for g in ["guidance_in.in_layer.bias", "model.diffusion_model.guidance_in.in_layer.bias"]
        ):
-            if "model.diffusion_model.img_in.weight" in checkpoint:
-                key = "model.diffusion_model.img_in.weight"
-            else:
-                key = "img_in.weight"
-
-            if checkpoint[key].shape[1] == 384:
-                model_type = "flux-fill"
-            elif checkpoint[key].shape[1] == 128:
-                model_type = "flux-depth"
-            else:
-                model_type = "flux-dev"
+            model_type = "flux-dev"
        else:
            model_type = "flux-schnell"

-    elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["ltx-video"]):
-        if "vae.decoder.last_time_embedder.timestep_embedder.linear_1.weight" in checkpoint:
-            model_type = "ltx-video-0.9.1"
-        else:
-            model_type = "ltx-video"
+    elif any(all(key in checkpoint for key in key_list) for key_list in CHECKPOINT_KEY_NAMES["ltx-video"]):
+        model_type = "ltx-video"

    elif CHECKPOINT_KEY_NAMES["autoencoder-dc"] in checkpoint:
        encoder_key = "encoder.project_in.conv.conv.bias"
@@ -631,12 +610,6 @@ def infer_diffusers_model_type(checkpoint):
        else:
            model_type = "autoencoder-dc-f128c512"

-    elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["mochi-1-preview"]):
-        model_type = "mochi-1-preview"
-
-    elif CHECKPOINT_KEY_NAMES["hunyuan-video"] in checkpoint:
-        model_type = "hunyuan-video"
-
    else:
        model_type = "v1"

@@ -1777,12 +1750,6 @@ def swap_scale_shift(weight, dim):
    return new_weight


-def swap_proj_gate(weight):
-    proj, gate = weight.chunk(2, dim=0)
-    new_weight = torch.cat([gate, proj], dim=0)
-    return new_weight
-
-
 def get_attn2_layers(state_dict):
    attn2_layers = []
    for key in state_dict.keys():
@@ -2280,7 +2247,9 @@ def convert_flux_transformer_checkpoint_to_diffusers(checkpoint, **kwargs):


 def convert_ltx_transformer_checkpoint_to_diffusers(checkpoint, **kwargs):
-    converted_state_dict = {key: checkpoint.pop(key) for key in list(checkpoint.keys()) if "vae" not in key}
+    converted_state_dict = {
+        key: checkpoint.pop(key) for key in list(checkpoint.keys()) if "model.diffusion_model." in key
+    }

    TRANSFORMER_KEYS_RENAME_DICT = {
        "model.diffusion_model.": "",
@@ -2346,32 +2315,12 @@ def convert_ltx_vae_checkpoint_to_diffusers(checkpoint, **kwargs):
        "per_channel_statistics.std-of-means": "latents_std",
    }

-    VAE_091_RENAME_DICT = {
-        # decoder
-        "up_blocks.0": "mid_block",
-        "up_blocks.1": "up_blocks.0.upsamplers.0",
-        "up_blocks.2": "up_blocks.0",
-        "up_blocks.3": "up_blocks.1.upsamplers.0",
-        "up_blocks.4": "up_blocks.1",
-        "up_blocks.5": "up_blocks.2.upsamplers.0",
-        "up_blocks.6": "up_blocks.2",
-        "up_blocks.7": "up_blocks.3.upsamplers.0",
-        "up_blocks.8": "up_blocks.3",
-        # common
-        "last_time_embedder": "time_embedder",
-        "last_scale_shift_table": "scale_shift_table",
-    }
-
    VAE_SPECIAL_KEYS_REMAP = {
        "per_channel_statistics.channel": remove_keys_,
        "per_channel_statistics.mean-of-means": remove_keys_,
        "per_channel_statistics.mean-of-stds": remove_keys_,
-        "timestep_scale_multiplier": remove_keys_,
    }

-    if "vae.decoder.last_time_embedder.timestep_embedder.linear_1.weight" in converted_state_dict:
-        VAE_KEYS_RENAME_DICT.update(VAE_091_RENAME_DICT)
-
    for key in list(converted_state_dict.keys()):
        new_key = key
        for replace_key, rename_key in VAE_KEYS_RENAME_DICT.items():
@@ -2457,231 +2406,3 @@ def convert_autoencoder_dc_checkpoint_to_diffusers(checkpoint, **kwargs):
            handler_fn_inplace(key, converted_state_dict)

    return converted_state_dict
-
-
-def convert_mochi_transformer_checkpoint_to_diffusers(checkpoint, **kwargs):
-    new_state_dict = {}
-
-    # Comfy checkpoints add this prefix
-    keys = list(checkpoint.keys())
-    for k in keys:
-        if "model.diffusion_model." in k:
-            checkpoint[k.replace("model.diffusion_model.", "")] = checkpoint.pop(k)
-
-    # Convert patch_embed
-    new_state_dict["patch_embed.proj.weight"] = checkpoint.pop("x_embedder.proj.weight")
-    new_state_dict["patch_embed.proj.bias"] = checkpoint.pop("x_embedder.proj.bias")
-
-    # Convert time_embed
-    new_state_dict["time_embed.timestep_embedder.linear_1.weight"] = checkpoint.pop("t_embedder.mlp.0.weight")
-    new_state_dict["time_embed.timestep_embedder.linear_1.bias"] = checkpoint.pop("t_embedder.mlp.0.bias")
-    new_state_dict["time_embed.timestep_embedder.linear_2.weight"] = checkpoint.pop("t_embedder.mlp.2.weight")
-    new_state_dict["time_embed.timestep_embedder.linear_2.bias"] = checkpoint.pop("t_embedder.mlp.2.bias")
-    new_state_dict["time_embed.pooler.to_kv.weight"] = checkpoint.pop("t5_y_embedder.to_kv.weight")
-    new_state_dict["time_embed.pooler.to_kv.bias"] = checkpoint.pop("t5_y_embedder.to_kv.bias")
-    new_state_dict["time_embed.pooler.to_q.weight"] = checkpoint.pop("t5_y_embedder.to_q.weight")
-    new_state_dict["time_embed.pooler.to_q.bias"] = checkpoint.pop("t5_y_embedder.to_q.bias")
-    new_state_dict["time_embed.pooler.to_out.weight"] = checkpoint.pop("t5_y_embedder.to_out.weight")
-    new_state_dict["time_embed.pooler.to_out.bias"] = checkpoint.pop("t5_y_embedder.to_out.bias")
-    new_state_dict["time_embed.caption_proj.weight"] = checkpoint.pop("t5_yproj.weight")
-    new_state_dict["time_embed.caption_proj.bias"] = checkpoint.pop("t5_yproj.bias")
-
-    # Convert transformer blocks
-    num_layers = 48
-    for i in range(num_layers):
-        block_prefix = f"transformer_blocks.{i}."
-        old_prefix = f"blocks.{i}."
-
-        # norm1
-        new_state_dict[block_prefix + "norm1.linear.weight"] = checkpoint.pop(old_prefix + "mod_x.weight")
-        new_state_dict[block_prefix + "norm1.linear.bias"] = checkpoint.pop(old_prefix + "mod_x.bias")
-        if i < num_layers - 1:
-            new_state_dict[block_prefix + "norm1_context.linear.weight"] = checkpoint.pop(old_prefix + "mod_y.weight")
-            new_state_dict[block_prefix + "norm1_context.linear.bias"] = checkpoint.pop(old_prefix + "mod_y.bias")
-        else:
-            new_state_dict[block_prefix + "norm1_context.linear_1.weight"] = checkpoint.pop(
-                old_prefix + "mod_y.weight"
-            )
-            new_state_dict[block_prefix + "norm1_context.linear_1.bias"] = checkpoint.pop(old_prefix + "mod_y.bias")
-
-        # Visual attention
-        qkv_weight = checkpoint.pop(old_prefix + "attn.qkv_x.weight")
-        q, k, v = qkv_weight.chunk(3, dim=0)
-
-        new_state_dict[block_prefix + "attn1.to_q.weight"] = q
-        new_state_dict[block_prefix + "attn1.to_k.weight"] = k
-        new_state_dict[block_prefix + "attn1.to_v.weight"] = v
-        new_state_dict[block_prefix + "attn1.norm_q.weight"] = checkpoint.pop(old_prefix + "attn.q_norm_x.weight")
-        new_state_dict[block_prefix + "attn1.norm_k.weight"] = checkpoint.pop(old_prefix + "attn.k_norm_x.weight")
-        new_state_dict[block_prefix + "attn1.to_out.0.weight"] = checkpoint.pop(old_prefix + "attn.proj_x.weight")
-        new_state_dict[block_prefix + "attn1.to_out.0.bias"] = checkpoint.pop(old_prefix + "attn.proj_x.bias")
-
-        # Context attention
-        qkv_weight = checkpoint.pop(old_prefix + "attn.qkv_y.weight")
-        q, k, v = qkv_weight.chunk(3, dim=0)
-
-        new_state_dict[block_prefix + "attn1.add_q_proj.weight"] = q
-        new_state_dict[block_prefix + "attn1.add_k_proj.weight"] = k
-        new_state_dict[block_prefix + "attn1.add_v_proj.weight"] = v
-        new_state_dict[block_prefix + "attn1.norm_added_q.weight"] = checkpoint.pop(
-            old_prefix + "attn.q_norm_y.weight"
-        )
-        new_state_dict[block_prefix + "attn1.norm_added_k.weight"] = checkpoint.pop(
-            old_prefix + "attn.k_norm_y.weight"
-        )
-        if i < num_layers - 1:
-            new_state_dict[block_prefix + "attn1.to_add_out.weight"] = checkpoint.pop(
-                old_prefix + "attn.proj_y.weight"
-            )
-            new_state_dict[block_prefix + "attn1.to_add_out.bias"] = checkpoint.pop(old_prefix + "attn.proj_y.bias")
-
-        # MLP
-        new_state_dict[block_prefix + "ff.net.0.proj.weight"] = swap_proj_gate(
-            checkpoint.pop(old_prefix + "mlp_x.w1.weight")
-        )
-        new_state_dict[block_prefix + "ff.net.2.weight"] = checkpoint.pop(old_prefix + "mlp_x.w2.weight")
-        if i < num_layers - 1:
-            new_state_dict[block_prefix + "ff_context.net.0.proj.weight"] = swap_proj_gate(
-                checkpoint.pop(old_prefix + "mlp_y.w1.weight")
-            )
-            new_state_dict[block_prefix + "ff_context.net.2.weight"] = checkpoint.pop(old_prefix + "mlp_y.w2.weight")
-
-    # Output layers
-    new_state_dict["norm_out.linear.weight"] = swap_scale_shift(checkpoint.pop("final_layer.mod.weight"), dim=0)
-    new_state_dict["norm_out.linear.bias"] = swap_scale_shift(checkpoint.pop("final_layer.mod.bias"), dim=0)
-    new_state_dict["proj_out.weight"] = checkpoint.pop("final_layer.linear.weight")
-    new_state_dict["proj_out.bias"] = checkpoint.pop("final_layer.linear.bias")
-
-    new_state_dict["pos_frequencies"] = checkpoint.pop("pos_frequencies")
-
-    return new_state_dict
-
-
-def convert_hunyuan_video_transformer_to_diffusers(checkpoint, **kwargs):
-    def remap_norm_scale_shift_(key, state_dict):
-        weight = state_dict.pop(key)
-        shift, scale = weight.chunk(2, dim=0)
-        new_weight = torch.cat([scale, shift], dim=0)
-        state_dict[key.replace("final_layer.adaLN_modulation.1", "norm_out.linear")] = new_weight
-
-    def remap_txt_in_(key, state_dict):
-        def rename_key(key):
-            new_key = key.replace("individual_token_refiner.blocks", "token_refiner.refiner_blocks")
-            new_key = new_key.replace("adaLN_modulation.1", "norm_out.linear")
-            new_key = new_key.replace("txt_in", "context_embedder")
-            new_key = new_key.replace("t_embedder.mlp.0", "time_text_embed.timestep_embedder.linear_1")
-            new_key = new_key.replace("t_embedder.mlp.2", "time_text_embed.timestep_embedder.linear_2")
-            new_key = new_key.replace("c_embedder", "time_text_embed.text_embedder")
-            new_key = new_key.replace("mlp", "ff")
-            return new_key
-
-        if "self_attn_qkv" in key:
-            weight = state_dict.pop(key)
-            to_q, to_k, to_v = weight.chunk(3, dim=0)
-            state_dict[rename_key(key.replace("self_attn_qkv", "attn.to_q"))] = to_q
-            state_dict[rename_key(key.replace("self_attn_qkv", "attn.to_k"))] = to_k
-            state_dict[rename_key(key.replace("self_attn_qkv", "attn.to_v"))] = to_v
-        else:
-            state_dict[rename_key(key)] = state_dict.pop(key)
-
-    def remap_img_attn_qkv_(key, state_dict):
-        weight = state_dict.pop(key)
-        to_q, to_k, to_v = weight.chunk(3, dim=0)
-        state_dict[key.replace("img_attn_qkv", "attn.to_q")] = to_q
-        state_dict[key.replace("img_attn_qkv", "attn.to_k")] = to_k
-        state_dict[key.replace("img_attn_qkv", "attn.to_v")] = to_v
-
-    def remap_txt_attn_qkv_(key, state_dict):
-        weight = state_dict.pop(key)
-        to_q, to_k, to_v = weight.chunk(3, dim=0)
-        state_dict[key.replace("txt_attn_qkv", "attn.add_q_proj")] = to_q
-        state_dict[key.replace("txt_attn_qkv", "attn.add_k_proj")] = to_k
-        state_dict[key.replace("txt_attn_qkv", "attn.add_v_proj")] = to_v
-
-    def remap_single_transformer_blocks_(key, state_dict):
-        hidden_size = 3072
-
-        if "linear1.weight" in key:
-            linear1_weight = state_dict.pop(key)
-            split_size = (hidden_size, hidden_size, hidden_size, linear1_weight.size(0) - 3 * hidden_size)
-            q, k, v, mlp = torch.split(linear1_weight, split_size, dim=0)
-            new_key = key.replace("single_blocks", "single_transformer_blocks").removesuffix(".linear1.weight")
-            state_dict[f"{new_key}.attn.to_q.weight"] = q
-            state_dict[f"{new_key}.attn.to_k.weight"] = k
-            state_dict[f"{new_key}.attn.to_v.weight"] = v
-            state_dict[f"{new_key}.proj_mlp.weight"] = mlp
-
-        elif "linear1.bias" in key:
-            linear1_bias = state_dict.pop(key)
-            split_size = (hidden_size, hidden_size, hidden_size, linear1_bias.size(0) - 3 * hidden_size)
-            q_bias, k_bias, v_bias, mlp_bias = torch.split(linear1_bias, split_size, dim=0)
-            new_key = key.replace("single_blocks", "single_transformer_blocks").removesuffix(".linear1.bias")
-            state_dict[f"{new_key}.attn.to_q.bias"] = q_bias
-            state_dict[f"{new_key}.attn.to_k.bias"] = k_bias
-            state_dict[f"{new_key}.attn.to_v.bias"] = v_bias
-            state_dict[f"{new_key}.proj_mlp.bias"] = mlp_bias
-
-        else:
-            new_key = key.replace("single_blocks", "single_transformer_blocks")
-            new_key = new_key.replace("linear2", "proj_out")
-            new_key = new_key.replace("q_norm", "attn.norm_q")
-            new_key = new_key.replace("k_norm", "attn.norm_k")
-            state_dict[new_key] = state_dict.pop(key)
-
-    TRANSFORMER_KEYS_RENAME_DICT = {
-        "img_in": "x_embedder",
-        "time_in.mlp.0": "time_text_embed.timestep_embedder.linear_1",
-        "time_in.mlp.2": "time_text_embed.timestep_embedder.linear_2",
-        "guidance_in.mlp.0": "time_text_embed.guidance_embedder.linear_1",
-        "guidance_in.mlp.2": "time_text_embed.guidance_embedder.linear_2",
-        "vector_in.in_layer": "time_text_embed.text_embedder.linear_1",
-        "vector_in.out_layer": "time_text_embed.text_embedder.linear_2",
-        "double_blocks": "transformer_blocks",
-        "img_attn_q_norm": "attn.norm_q",
-        "img_attn_k_norm": "attn.norm_k",
-        "img_attn_proj": "attn.to_out.0",
-        "txt_attn_q_norm": "attn.norm_added_q",
-        "txt_attn_k_norm": "attn.norm_added_k",
-        "txt_attn_proj": "attn.to_add_out",
-        "img_mod.linear": "norm1.linear",
-        "img_norm1": "norm1.norm",
-        "img_norm2": "norm2",
-        "img_mlp": "ff",
-        "txt_mod.linear": "norm1_context.linear",
-        "txt_norm1": "norm1.norm",
-        "txt_norm2": "norm2_context",
-        "txt_mlp": "ff_context",
-        "self_attn_proj": "attn.to_out.0",
-        "modulation.linear": "norm.linear",
-        "pre_norm": "norm.norm",
-        "final_layer.norm_final": "norm_out.norm",
-        "final_layer.linear": "proj_out",
-        "fc1": "net.0.proj",
-        "fc2": "net.2",
-        "input_embedder": "proj_in",
-    }
-
-    TRANSFORMER_SPECIAL_KEYS_REMAP = {
-        "txt_in": remap_txt_in_,
-        "img_attn_qkv": remap_img_attn_qkv_,
-        "txt_attn_qkv": remap_txt_attn_qkv_,
-        "single_blocks": remap_single_transformer_blocks_,
-        "final_layer.adaLN_modulation.1": remap_norm_scale_shift_,
-    }
-
-    def update_state_dict_(state_dict, old_key, new_key):
-        state_dict[new_key] = state_dict.pop(old_key)
-
-    for key in list(checkpoint.keys()):
-        new_key = key[:]
-        for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
-            new_key = new_key.replace(replace_key, rename_key)
-        update_state_dict_(checkpoint, key, new_key)
-
-    for key in list(checkpoint.keys()):
-        for special_key, handler_fn_inplace in TRANSFORMER_SPECIAL_KEYS_REMAP.items():
-            if special_key not in key:
-                continue
-            handler_fn_inplace(key, checkpoint)
-
-    return checkpoint
@@ -1,181 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from contextlib import nullcontext
-
-from ..models.embeddings import (
-    ImageProjection,
-    MultiIPAdapterImageProjection,
-)
-from ..models.modeling_utils import load_model_dict_into_meta
-from ..utils import (
-    is_accelerate_available,
-    is_torch_version,
-    logging,
-)
-
-
-if is_accelerate_available():
-    pass
-
-logger = logging.get_logger(__name__)
-
-
-class FluxTransformer2DLoadersMixin:
-    """
-    Load layers into a [`FluxTransformer2DModel`].
-    """
-
-    def _convert_ip_adapter_image_proj_to_diffusers(self, state_dict, low_cpu_mem_usage=False):
-        if low_cpu_mem_usage:
-            if is_accelerate_available():
-                from accelerate import init_empty_weights
-
-            else:
-                low_cpu_mem_usage = False
-                logger.warning(
-                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                    " install accelerate\n```\n."
-                )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        updated_state_dict = {}
-        image_projection = None
-        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
-
-        if "proj.weight" in state_dict:
-            # IP-Adapter
-            num_image_text_embeds = 4
-            if state_dict["proj.weight"].shape[0] == 65536:
-                num_image_text_embeds = 16
-            clip_embeddings_dim = state_dict["proj.weight"].shape[-1]
-            cross_attention_dim = state_dict["proj.weight"].shape[0] // num_image_text_embeds
-
-            with init_context():
-                image_projection = ImageProjection(
-                    cross_attention_dim=cross_attention_dim,
-                    image_embed_dim=clip_embeddings_dim,
-                    num_image_text_embeds=num_image_text_embeds,
-                )
-
-            for key, value in state_dict.items():
-                diffusers_name = key.replace("proj", "image_embeds")
-                updated_state_dict[diffusers_name] = value
-
-        if not low_cpu_mem_usage:
-            image_projection.load_state_dict(updated_state_dict, strict=True)
-        else:
-            load_model_dict_into_meta(image_projection, updated_state_dict, device=self.device, dtype=self.dtype)
-
-        return image_projection
-
-    def _convert_ip_adapter_attn_to_diffusers(self, state_dicts, low_cpu_mem_usage=False):
-        from ..models.attention_processor import (
-            FluxIPAdapterJointAttnProcessor2_0,
-        )
-
-        if low_cpu_mem_usage:
-            if is_accelerate_available():
-                from accelerate import init_empty_weights
-
-            else:
-                low_cpu_mem_usage = False
-                logger.warning(
-                    "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the"
-                    " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install"
-                    " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip"
-                    " install accelerate\n```\n."
-                )
-
-        if low_cpu_mem_usage is True and not is_torch_version(">=", "1.9.0"):
-            raise NotImplementedError(
-                "Low memory initialization requires torch >= 1.9.0. Please either update your PyTorch version or set"
-                " `low_cpu_mem_usage=False`."
-            )
-
-        # set ip-adapter cross-attention processors & load state_dict
-        attn_procs = {}
-        key_id = 0
-        init_context = init_empty_weights if low_cpu_mem_usage else nullcontext
-        for name in self.attn_processors.keys():
-            if name.startswith("single_transformer_blocks"):
-                attn_processor_class = self.attn_processors[name].__class__
-                attn_procs[name] = attn_processor_class()
-            else:
-                cross_attention_dim = self.config.joint_attention_dim
-                hidden_size = self.inner_dim
-                attn_processor_class = FluxIPAdapterJointAttnProcessor2_0
-                num_image_text_embeds = []
-                for state_dict in state_dicts:
-                    if "proj.weight" in state_dict["image_proj"]:
-                        num_image_text_embed = 4
-                        if state_dict["image_proj"]["proj.weight"].shape[0] == 65536:
-                            num_image_text_embed = 16
-                        # IP-Adapter
-                        num_image_text_embeds += [num_image_text_embed]
-
-                with init_context():
-                    attn_procs[name] = attn_processor_class(
-                        hidden_size=hidden_size,
-                        cross_attention_dim=cross_attention_dim,
-                        scale=1.0,
-                        num_tokens=num_image_text_embeds,
-                        dtype=self.dtype,
-                        device=self.device,
-                    )
-
-                value_dict = {}
-                for i, state_dict in enumerate(state_dicts):
-                    value_dict.update({f"to_k_ip.{i}.weight": state_dict["ip_adapter"][f"{key_id}.to_k_ip.weight"]})
-                    value_dict.update({f"to_v_ip.{i}.weight": state_dict["ip_adapter"][f"{key_id}.to_v_ip.weight"]})
-                    value_dict.update({f"to_k_ip.{i}.bias": state_dict["ip_adapter"][f"{key_id}.to_k_ip.bias"]})
-                    value_dict.update({f"to_v_ip.{i}.bias": state_dict["ip_adapter"][f"{key_id}.to_v_ip.bias"]})
-
-                if not low_cpu_mem_usage:
-                    attn_procs[name].load_state_dict(value_dict)
-                else:
-                    device = self.device
-                    dtype = self.dtype
-                    load_model_dict_into_meta(attn_procs[name], value_dict, device=device, dtype=dtype)
-
-                key_id += 1
-
-        return attn_procs
-
-    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=False):
-        if not isinstance(state_dicts, list):
-            state_dicts = [state_dicts]
-
-        self.encoder_hid_proj = None
-
-        attn_procs = self._convert_ip_adapter_attn_to_diffusers(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage)
-        self.set_attn_processor(attn_procs)
-
-        image_projection_layers = []
-        for state_dict in state_dicts:
-            image_projection_layer = self._convert_ip_adapter_image_proj_to_diffusers(
-                state_dict["image_proj"], low_cpu_mem_usage=low_cpu_mem_usage
-            )
-            image_projection_layers.append(image_projection_layer)
-
-        self.encoder_hid_proj = MultiIPAdapterImageProjection(image_projection_layers)
-        self.config.encoder_hid_dim_type = "ip_image_proj"
-
-        self.to(dtype=self.dtype, device=self.device)
@@ -1,89 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from typing import Dict
-
-from ..models.attention_processor import SD3IPAdapterJointAttnProcessor2_0
-from ..models.embeddings import IPAdapterTimeImageProjection
-from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
-
-
-class SD3Transformer2DLoadersMixin:
-    """Load IP-Adapters and LoRA layers into a `[SD3Transformer2DModel]`."""
-
-    def _load_ip_adapter_weights(self, state_dict: Dict, low_cpu_mem_usage: bool = _LOW_CPU_MEM_USAGE_DEFAULT) -> None:
-        """Sets IP-Adapter attention processors, image projection, and loads state_dict.
-
-        Args:
-            state_dict (`Dict`):
-                State dict with keys "ip_adapter", which contains parameters for attention processors, and
-                "image_proj", which contains parameters for image projection net.
-            low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`):
-                Speed up model loading only loading the pretrained weights and not initializing the weights. This also
-                tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model.
-                Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this
-                argument to `True` will raise an error.
-        """
-        # IP-Adapter cross attention parameters
-        hidden_size = self.config.attention_head_dim * self.config.num_attention_heads
-        ip_hidden_states_dim = self.config.attention_head_dim * self.config.num_attention_heads
-        timesteps_emb_dim = state_dict["ip_adapter"]["0.norm_ip.linear.weight"].shape[1]
-
-        # Dict where key is transformer layer index, value is attention processor's state dict
-        # ip_adapter state dict keys example: "0.norm_ip.linear.weight"
-        layer_state_dict = {idx: {} for idx in range(len(self.attn_processors))}
-        for key, weights in state_dict["ip_adapter"].items():
-            idx, name = key.split(".", maxsplit=1)
-            layer_state_dict[int(idx)][name] = weights
-
-        # Create IP-Adapter attention processor
-        attn_procs = {}
-        for idx, name in enumerate(self.attn_processors.keys()):
-            attn_procs[name] = SD3IPAdapterJointAttnProcessor2_0(
-                hidden_size=hidden_size,
-                ip_hidden_states_dim=ip_hidden_states_dim,
-                head_dim=self.config.attention_head_dim,
-                timesteps_emb_dim=timesteps_emb_dim,
-            ).to(self.device, dtype=self.dtype)
-
-            if not low_cpu_mem_usage:
-                attn_procs[name].load_state_dict(layer_state_dict[idx], strict=True)
-            else:
-                load_model_dict_into_meta(
-                    attn_procs[name], layer_state_dict[idx], device=self.device, dtype=self.dtype
-                )
-
-        self.set_attn_processor(attn_procs)
-
-        # Image projetion parameters
-        embed_dim = state_dict["image_proj"]["proj_in.weight"].shape[1]
-        output_dim = state_dict["image_proj"]["proj_out.weight"].shape[0]
-        hidden_dim = state_dict["image_proj"]["proj_in.weight"].shape[0]
-        heads = state_dict["image_proj"]["layers.0.attn.to_q.weight"].shape[0] // 64
-        num_queries = state_dict["image_proj"]["latents"].shape[1]
-        timestep_in_dim = state_dict["image_proj"]["time_embedding.linear_1.weight"].shape[1]
-
-        # Image projection
-        self.image_proj = IPAdapterTimeImageProjection(
-            embed_dim=embed_dim,
-            output_dim=output_dim,
-            hidden_dim=hidden_dim,
-            heads=heads,
-            num_queries=num_queries,
-            timestep_in_dim=timestep_in_dim,
-        ).to(device=self.device, dtype=self.dtype)
-
-        if not low_cpu_mem_usage:
-            self.image_proj.load_state_dict(state_dict["image_proj"], strict=True)
-        else:
-            load_model_dict_into_meta(self.image_proj, state_dict["image_proj"], device=self.device, dtype=self.dtype)
@@ -21,6 +21,7 @@ import safetensors
 import torch
 import torch.nn.functional as F
 from huggingface_hub.utils import validate_hf_hub_args
+from torch import nn

 from ..models.embeddings import (
    ImageProjection,
@@ -43,11 +44,13 @@ from ..utils import (
    is_torch_version,
    logging,
 )
-from .lora_base import _func_optionally_disable_offloading
 from .lora_pipeline import LORA_WEIGHT_NAME, LORA_WEIGHT_NAME_SAFE, TEXT_ENCODER_NAME, UNET_NAME
 from .utils import AttnProcsLayers


+if is_accelerate_available():
+    from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module
+
 logger = logging.get_logger(__name__)


@@ -397,7 +400,27 @@ class UNet2DConditionLoadersMixin:
            tuple:
                A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True.
        """
-        return _func_optionally_disable_offloading(_pipeline=_pipeline)
+        is_model_cpu_offload = False
+        is_sequential_cpu_offload = False
+
+        if _pipeline is not None and _pipeline.hf_device_map is None:
+            for _, component in _pipeline.components.items():
+                if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"):
+                    if not is_model_cpu_offload:
+                        is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload)
+                    if not is_sequential_cpu_offload:
+                        is_sequential_cpu_offload = (
+                            isinstance(component._hf_hook, AlignDevicesHook)
+                            or hasattr(component._hf_hook, "hooks")
+                            and isinstance(component._hf_hook.hooks[0], AlignDevicesHook)
+                        )
+
+                    logger.info(
+                        "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again."
+                    )
+                    remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
+
+        return (is_model_cpu_offload, is_sequential_cpu_offload)

    def save_attn_procs(
        self,
@@ -188,13 +188,8 @@ class JointTransformerBlock(nn.Module):
        self._chunk_dim = dim

    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor,
-        temb: torch.FloatTensor,
-        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        self, hidden_states: torch.FloatTensor, encoder_hidden_states: torch.FloatTensor, temb: torch.FloatTensor
    ):
-        joint_attention_kwargs = joint_attention_kwargs or {}
        if self.use_dual_attention:
            norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp, norm_hidden_states2, gate_msa2 = self.norm1(
                hidden_states, emb=temb
@@ -211,9 +206,7 @@ class JointTransformerBlock(nn.Module):

        # Attention.
        attn_output, context_attn_output = self.attn(
-            hidden_states=norm_hidden_states,
-            encoder_hidden_states=norm_encoder_hidden_states,
-            **joint_attention_kwargs,
+            hidden_states=norm_hidden_states, encoder_hidden_states=norm_encoder_hidden_states
        )

        # Process attention outputs for the `hidden_states`.
@@ -221,7 +214,7 @@ class JointTransformerBlock(nn.Module):
        hidden_states = hidden_states + attn_output

        if self.use_dual_attention:
-            attn_output2 = self.attn2(hidden_states=norm_hidden_states2, **joint_attention_kwargs)
+            attn_output2 = self.attn2(hidden_states=norm_hidden_states2)
            attn_output2 = gate_msa2.unsqueeze(1) * attn_output2
            hidden_states = hidden_states + attn_output2

@@ -575,7 +575,7 @@ class Attention(nn.Module):
        # For standard processors that are defined here, `**cross_attention_kwargs` is empty

        attn_parameters = set(inspect.signature(self.processor.__call__).parameters.keys())
-        quiet_attn_parameters = {"ip_adapter_masks", "ip_hidden_states"}
+        quiet_attn_parameters = {"ip_adapter_masks"}
        unused_kwargs = [
            k for k, _ in cross_attention_kwargs.items() if k not in attn_parameters and k not in quiet_attn_parameters
        ]
@@ -2653,149 +2653,6 @@ class FusedFluxAttnProcessor2_0_NPU:
            return hidden_states


-class FluxIPAdapterJointAttnProcessor2_0(torch.nn.Module):
-    """Flux Attention processor for IP-Adapter."""
-
-    def __init__(
-        self, hidden_size: int, cross_attention_dim: int, num_tokens=(4,), scale=1.0, device=None, dtype=None
-    ):
-        super().__init__()
-
-        if not hasattr(F, "scaled_dot_product_attention"):
-            raise ImportError(
-                f"{self.__class__.__name__} requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
-            )
-
-        self.hidden_size = hidden_size
-        self.cross_attention_dim = cross_attention_dim
-
-        if not isinstance(num_tokens, (tuple, list)):
-            num_tokens = [num_tokens]
-
-        if not isinstance(scale, list):
-            scale = [scale] * len(num_tokens)
-        if len(scale) != len(num_tokens):
-            raise ValueError("`scale` should be a list of integers with the same length as `num_tokens`.")
-        self.scale = scale
-
-        self.to_k_ip = nn.ModuleList(
-            [
-                nn.Linear(cross_attention_dim, hidden_size, bias=True, device=device, dtype=dtype)
-                for _ in range(len(num_tokens))
-            ]
-        )
-        self.to_v_ip = nn.ModuleList(
-            [
-                nn.Linear(cross_attention_dim, hidden_size, bias=True, device=device, dtype=dtype)
-                for _ in range(len(num_tokens))
-            ]
-        )
-
-    def __call__(
-        self,
-        attn: Attention,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor = None,
-        attention_mask: Optional[torch.FloatTensor] = None,
-        image_rotary_emb: Optional[torch.Tensor] = None,
-        ip_hidden_states: Optional[List[torch.Tensor]] = None,
-        ip_adapter_masks: Optional[torch.Tensor] = None,
-    ) -> torch.FloatTensor:
-        batch_size, _, _ = hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
-
-        # `sample` projections.
-        hidden_states_query_proj = attn.to_q(hidden_states)
-        key = attn.to_k(hidden_states)
-        value = attn.to_v(hidden_states)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-
-        hidden_states_query_proj = hidden_states_query_proj.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        if attn.norm_q is not None:
-            hidden_states_query_proj = attn.norm_q(hidden_states_query_proj)
-        if attn.norm_k is not None:
-            key = attn.norm_k(key)
-
-        # the attention in FluxSingleTransformerBlock does not use `encoder_hidden_states`
-        if encoder_hidden_states is not None:
-            # `context` projections.
-            encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states)
-            encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states)
-            encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states)
-
-            encoder_hidden_states_query_proj = encoder_hidden_states_query_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-            encoder_hidden_states_key_proj = encoder_hidden_states_key_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-            encoder_hidden_states_value_proj = encoder_hidden_states_value_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-
-            if attn.norm_added_q is not None:
-                encoder_hidden_states_query_proj = attn.norm_added_q(encoder_hidden_states_query_proj)
-            if attn.norm_added_k is not None:
-                encoder_hidden_states_key_proj = attn.norm_added_k(encoder_hidden_states_key_proj)
-
-            # attention
-            query = torch.cat([encoder_hidden_states_query_proj, hidden_states_query_proj], dim=2)
-            key = torch.cat([encoder_hidden_states_key_proj, key], dim=2)
-            value = torch.cat([encoder_hidden_states_value_proj, value], dim=2)
-
-        if image_rotary_emb is not None:
-            from .embeddings import apply_rotary_emb
-
-            query = apply_rotary_emb(query, image_rotary_emb)
-            key = apply_rotary_emb(key, image_rotary_emb)
-
-        hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
-        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
-        hidden_states = hidden_states.to(query.dtype)
-
-        if encoder_hidden_states is not None:
-            encoder_hidden_states, hidden_states = (
-                hidden_states[:, : encoder_hidden_states.shape[1]],
-                hidden_states[:, encoder_hidden_states.shape[1] :],
-            )
-
-            # linear proj
-            hidden_states = attn.to_out[0](hidden_states)
-            # dropout
-            hidden_states = attn.to_out[1](hidden_states)
-            encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
-
-            # IP-adapter
-            ip_query = hidden_states_query_proj
-            ip_attn_output = None
-            # for ip-adapter
-            # TODO: support for multiple adapters
-            for current_ip_hidden_states, scale, to_k_ip, to_v_ip in zip(
-                ip_hidden_states, self.scale, self.to_k_ip, self.to_v_ip
-            ):
-                ip_key = to_k_ip(current_ip_hidden_states)
-                ip_value = to_v_ip(current_ip_hidden_states)
-
-                ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-                ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-                # the output of sdp = (batch, num_heads, seq_len, head_dim)
-                # TODO: add support for attn.scale when we move to Torch 2.1
-                ip_attn_output = F.scaled_dot_product_attention(
-                    ip_query, ip_key, ip_value, attn_mask=None, dropout_p=0.0, is_causal=False
-                )
-                ip_attn_output = ip_attn_output.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
-                ip_attn_output = scale * ip_attn_output
-                ip_attn_output = ip_attn_output.to(ip_query.dtype)
-
-            return hidden_states, encoder_hidden_states, ip_attn_output
-        else:
-            return hidden_states
-
-
 class CogVideoXAttnProcessor2_0:
    r"""
    Processor for implementing scaled dot-product attention for the CogVideoX model. It applies a rotary embedding on
@@ -5386,177 +5243,6 @@ class IPAdapterXFormersAttnProcessor(torch.nn.Module):
        return hidden_states


-class SD3IPAdapterJointAttnProcessor2_0(torch.nn.Module):
-    """
-    Attention processor for IP-Adapter used typically in processing the SD3-like self-attention projections, with
-    additional image-based information and timestep embeddings.
-
-    Args:
-        hidden_size (`int`):
-            The number of hidden channels.
-        ip_hidden_states_dim (`int`):
-            The image feature dimension.
-        head_dim (`int`):
-            The number of head channels.
-        timesteps_emb_dim (`int`, defaults to 1280):
-            The number of input channels for timestep embedding.
-        scale (`float`, defaults to 0.5):
-            IP-Adapter scale.
-    """
-
-    def __init__(
-        self,
-        hidden_size: int,
-        ip_hidden_states_dim: int,
-        head_dim: int,
-        timesteps_emb_dim: int = 1280,
-        scale: float = 0.5,
-    ):
-        super().__init__()
-
-        # To prevent circular import
-        from .normalization import AdaLayerNorm, RMSNorm
-
-        self.norm_ip = AdaLayerNorm(timesteps_emb_dim, output_dim=ip_hidden_states_dim * 2, norm_eps=1e-6, chunk_dim=1)
-        self.to_k_ip = nn.Linear(ip_hidden_states_dim, hidden_size, bias=False)
-        self.to_v_ip = nn.Linear(ip_hidden_states_dim, hidden_size, bias=False)
-        self.norm_q = RMSNorm(head_dim, 1e-6)
-        self.norm_k = RMSNorm(head_dim, 1e-6)
-        self.norm_ip_k = RMSNorm(head_dim, 1e-6)
-        self.scale = scale
-
-    def __call__(
-        self,
-        attn: Attention,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor = None,
-        attention_mask: Optional[torch.FloatTensor] = None,
-        ip_hidden_states: torch.FloatTensor = None,
-        temb: torch.FloatTensor = None,
-    ) -> torch.FloatTensor:
-        """
-        Perform the attention computation, integrating image features (if provided) and timestep embeddings.
-
-        If `ip_hidden_states` is `None`, this is equivalent to using JointAttnProcessor2_0.
-
-        Args:
-            attn (`Attention`):
-                Attention instance.
-            hidden_states (`torch.FloatTensor`):
-                Input `hidden_states`.
-            encoder_hidden_states (`torch.FloatTensor`, *optional*):
-                The encoder hidden states.
-            attention_mask (`torch.FloatTensor`, *optional*):
-                Attention mask.
-            ip_hidden_states (`torch.FloatTensor`, *optional*):
-                Image embeddings.
-            temb (`torch.FloatTensor`, *optional*):
-                Timestep embeddings.
-
-        Returns:
-            `torch.FloatTensor`: Output hidden states.
-        """
-        residual = hidden_states
-
-        batch_size = hidden_states.shape[0]
-
-        # `sample` projections.
-        query = attn.to_q(hidden_states)
-        key = attn.to_k(hidden_states)
-        value = attn.to_v(hidden_states)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-
-        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        img_query = query
-        img_key = key
-        img_value = value
-
-        if attn.norm_q is not None:
-            query = attn.norm_q(query)
-        if attn.norm_k is not None:
-            key = attn.norm_k(key)
-
-        # `context` projections.
-        if encoder_hidden_states is not None:
-            encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states)
-            encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states)
-            encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states)
-
-            encoder_hidden_states_query_proj = encoder_hidden_states_query_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-            encoder_hidden_states_key_proj = encoder_hidden_states_key_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-            encoder_hidden_states_value_proj = encoder_hidden_states_value_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-
-            if attn.norm_added_q is not None:
-                encoder_hidden_states_query_proj = attn.norm_added_q(encoder_hidden_states_query_proj)
-            if attn.norm_added_k is not None:
-                encoder_hidden_states_key_proj = attn.norm_added_k(encoder_hidden_states_key_proj)
-
-            query = torch.cat([query, encoder_hidden_states_query_proj], dim=2)
-            key = torch.cat([key, encoder_hidden_states_key_proj], dim=2)
-            value = torch.cat([value, encoder_hidden_states_value_proj], dim=2)
-
-        hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
-        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
-        hidden_states = hidden_states.to(query.dtype)
-
-        if encoder_hidden_states is not None:
-            # Split the attention outputs.
-            hidden_states, encoder_hidden_states = (
-                hidden_states[:, : residual.shape[1]],
-                hidden_states[:, residual.shape[1] :],
-            )
-            if not attn.context_pre_only:
-                encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
-
-        # IP Adapter
-        if self.scale != 0 and ip_hidden_states is not None:
-            # Norm image features
-            norm_ip_hidden_states = self.norm_ip(ip_hidden_states, temb=temb)
-
-            # To k and v
-            ip_key = self.to_k_ip(norm_ip_hidden_states)
-            ip_value = self.to_v_ip(norm_ip_hidden_states)
-
-            # Reshape
-            ip_key = ip_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-            ip_value = ip_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-            # Norm
-            query = self.norm_q(img_query)
-            img_key = self.norm_k(img_key)
-            ip_key = self.norm_ip_k(ip_key)
-
-            # cat img
-            key = torch.cat([img_key, ip_key], dim=2)
-            value = torch.cat([img_value, ip_value], dim=2)
-
-            ip_hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
-            ip_hidden_states = ip_hidden_states.transpose(1, 2).view(batch_size, -1, attn.heads * head_dim)
-            ip_hidden_states = ip_hidden_states.to(query.dtype)
-
-            hidden_states = hidden_states + ip_hidden_states * self.scale
-
-        # linear proj
-        hidden_states = attn.to_out[0](hidden_states)
-        # dropout
-        hidden_states = attn.to_out[1](hidden_states)
-
-        if encoder_hidden_states is not None:
-            return hidden_states, encoder_hidden_states
-        else:
-            return hidden_states
-
-
 class PAGIdentitySelfAttnProcessor2_0:
    r"""
    Processor for implementing PAG using scaled dot-product attention (enabled by default if you're using PyTorch 2.0).
@@ -6039,7 +5725,6 @@ CROSS_ATTENTION_PROCESSORS = (
    SlicedAttnProcessor,
    IPAdapterAttnProcessor,
    IPAdapterAttnProcessor2_0,
-    FluxIPAdapterJointAttnProcessor2_0,
 )

 AttentionProcessor = Union[
@@ -6087,7 +5772,6 @@ AttentionProcessor = Union[
    IPAdapterAttnProcessor,
    IPAdapterAttnProcessor2_0,
    IPAdapterXFormersAttnProcessor,
-    SD3IPAdapterJointAttnProcessor2_0,
    PAGIdentitySelfAttnProcessor2_0,
    PAGCFGIdentitySelfAttnProcessor2_0,
    LoRAAttnProcessor,
@@ -168,7 +168,6 @@ class HunyuanVideoResnetBlockCausal3D(nn.Module):
            self.conv_shortcut = HunyuanVideoCausalConv3d(in_channels, out_channels, 1, 1, 0)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        hidden_states = hidden_states.contiguous()
        residual = hidden_states

        hidden_states = self.norm1(hidden_states)
@@ -793,12 +792,12 @@ class AutoencoderKLHunyuanVideo(ModelMixin, ConfigMixin):
        # The minimal tile height and width for spatial tiling to be used
        self.tile_sample_min_height = 256
        self.tile_sample_min_width = 256
-        self.tile_sample_min_num_frames = 16
+        self.tile_sample_min_num_frames = 64

        # The minimal distance between two spatial tiles
        self.tile_sample_stride_height = 192
        self.tile_sample_stride_width = 192
-        self.tile_sample_stride_num_frames = 12
+        self.tile_sample_stride_num_frames = 48

    def _set_gradient_checkpointing(self, module, value=False):
        if isinstance(module, (HunyuanVideoEncoder3D, HunyuanVideoDecoder3D)):
@@ -1004,7 +1003,7 @@ class AutoencoderKLHunyuanVideo(ModelMixin, ConfigMixin):
        for i in range(0, height, self.tile_sample_stride_height):
            row = []
            for j in range(0, width, self.tile_sample_stride_width):
-                tile = x[:, :, :, i : i + self.tile_sample_min_height, j : j + self.tile_sample_min_width]
+                tile = x[:, :, :, i : i + self.tile_sample_min_size, j : j + self.tile_sample_min_size]
                tile = self.encoder(tile)
                tile = self.quant_conv(tile)
                row.append(tile)
@@ -1021,7 +1020,7 @@ class AutoencoderKLHunyuanVideo(ModelMixin, ConfigMixin):
                if j > 0:
                    tile = self.blend_h(row[j - 1], tile, blend_width)
                result_row.append(tile[:, :, :, :tile_latent_stride_height, :tile_latent_stride_width])
-            result_rows.append(torch.cat(result_row, dim=4))
+            result_rows.append(torch.cat(result_row, dim=-1))

        enc = torch.cat(result_rows, dim=3)[:, :, :, :latent_height, :latent_width]
        return enc
@@ -22,14 +22,13 @@ from ...configuration_utils import ConfigMixin, register_to_config
 from ...loaders import FromOriginalModelMixin
 from ...utils.accelerate_utils import apply_forward_hook
 from ..activations import get_activation
-from ..embeddings import PixArtAlphaCombinedTimestepSizeEmbeddings
 from ..modeling_outputs import AutoencoderKLOutput
 from ..modeling_utils import ModelMixin
 from ..normalization import RMSNorm
 from .vae import DecoderOutput, DiagonalGaussianDistribution


-class LTXVideoCausalConv3d(nn.Module):
+class LTXCausalConv3d(nn.Module):
    def __init__(
        self,
        in_channels: int,
@@ -80,9 +79,9 @@ class LTXVideoCausalConv3d(nn.Module):
        return hidden_states


-class LTXVideoResnetBlock3d(nn.Module):
+class LTXResnetBlock3d(nn.Module):
    r"""
-    A 3D ResNet block used in the LTXVideo model.
+    A 3D ResNet block used in the LTX model.

    Args:
        in_channels (`int`):
@@ -110,9 +109,7 @@ class LTXVideoResnetBlock3d(nn.Module):
        elementwise_affine: bool = False,
        non_linearity: str = "swish",
        is_causal: bool = True,
-        inject_noise: bool = False,
-        timestep_conditioning: bool = False,
-    ) -> None:
+    ):
        super().__init__()

        out_channels = out_channels or in_channels
@@ -120,13 +117,13 @@ class LTXVideoResnetBlock3d(nn.Module):
        self.nonlinearity = get_activation(non_linearity)

        self.norm1 = RMSNorm(in_channels, eps=1e-8, elementwise_affine=elementwise_affine)
-        self.conv1 = LTXVideoCausalConv3d(
+        self.conv1 = LTXCausalConv3d(
            in_channels=in_channels, out_channels=out_channels, kernel_size=3, is_causal=is_causal
        )

        self.norm2 = RMSNorm(out_channels, eps=1e-8, elementwise_affine=elementwise_affine)
        self.dropout = nn.Dropout(dropout)
-        self.conv2 = LTXVideoCausalConv3d(
+        self.conv2 = LTXCausalConv3d(
            in_channels=out_channels, out_channels=out_channels, kernel_size=3, is_causal=is_causal
        )

@@ -134,58 +131,22 @@ class LTXVideoResnetBlock3d(nn.Module):
        self.conv_shortcut = None
        if in_channels != out_channels:
            self.norm3 = nn.LayerNorm(in_channels, eps=eps, elementwise_affine=True, bias=True)
-            self.conv_shortcut = LTXVideoCausalConv3d(
+            self.conv_shortcut = LTXCausalConv3d(
                in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=1, is_causal=is_causal
            )

-        self.per_channel_scale1 = None
-        self.per_channel_scale2 = None
-        if inject_noise:
-            self.per_channel_scale1 = nn.Parameter(torch.zeros(in_channels, 1, 1))
-            self.per_channel_scale2 = nn.Parameter(torch.zeros(in_channels, 1, 1))
-
-        self.scale_shift_table = None
-        if timestep_conditioning:
-            self.scale_shift_table = nn.Parameter(torch.randn(4, in_channels) / in_channels**0.5)
-
-    def forward(
-        self, inputs: torch.Tensor, temb: Optional[torch.Tensor] = None, generator: Optional[torch.Generator] = None
-    ) -> torch.Tensor:
+    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
        hidden_states = inputs

        hidden_states = self.norm1(hidden_states.movedim(1, -1)).movedim(-1, 1)
-
-        if self.scale_shift_table is not None:
-            temb = temb.unflatten(1, (4, -1)) + self.scale_shift_table[None, ..., None, None, None]
-            shift_1, scale_1, shift_2, scale_2 = temb.unbind(dim=1)
-            hidden_states = hidden_states * (1 + scale_1) + shift_1
-
        hidden_states = self.nonlinearity(hidden_states)
        hidden_states = self.conv1(hidden_states)

-        if self.per_channel_scale1 is not None:
-            spatial_shape = hidden_states.shape[-2:]
-            spatial_noise = torch.randn(
-                spatial_shape, generator=generator, device=hidden_states.device, dtype=hidden_states.dtype
-            )[None]
-            hidden_states = hidden_states + (spatial_noise * self.per_channel_scale1)[None, :, None, ...]
-
        hidden_states = self.norm2(hidden_states.movedim(1, -1)).movedim(-1, 1)
-
-        if self.scale_shift_table is not None:
-            hidden_states = hidden_states * (1 + scale_2) + shift_2
-
        hidden_states = self.nonlinearity(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.conv2(hidden_states)

-        if self.per_channel_scale2 is not None:
-            spatial_shape = hidden_states.shape[-2:]
-            spatial_noise = torch.randn(
-                spatial_shape, generator=generator, device=hidden_states.device, dtype=hidden_states.dtype
-            )[None]
-            hidden_states = hidden_states + (spatial_noise * self.per_channel_scale2)[None, :, None, ...]
-
        if self.norm3 is not None:
            inputs = self.norm3(inputs.movedim(1, -1)).movedim(-1, 1)

@@ -196,24 +157,20 @@ class LTXVideoResnetBlock3d(nn.Module):
        return hidden_states


-class LTXVideoUpsampler3d(nn.Module):
+class LTXUpsampler3d(nn.Module):
    def __init__(
        self,
        in_channels: int,
        stride: Union[int, Tuple[int, int, int]] = 1,
        is_causal: bool = True,
-        residual: bool = False,
-        upscale_factor: int = 1,
    ) -> None:
        super().__init__()

        self.stride = stride if isinstance(stride, tuple) else (stride, stride, stride)
-        self.residual = residual
-        self.upscale_factor = upscale_factor

-        out_channels = (in_channels * stride[0] * stride[1] * stride[2]) // upscale_factor
+        out_channels = in_channels * stride[0] * stride[1] * stride[2]

-        self.conv = LTXVideoCausalConv3d(
+        self.conv = LTXCausalConv3d(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=3,
@@ -224,15 +181,6 @@ class LTXVideoUpsampler3d(nn.Module):
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        batch_size, num_channels, num_frames, height, width = hidden_states.shape

-        if self.residual:
-            residual = hidden_states.reshape(
-                batch_size, -1, self.stride[0], self.stride[1], self.stride[2], num_frames, height, width
-            )
-            residual = residual.permute(0, 1, 5, 2, 6, 3, 7, 4).flatten(6, 7).flatten(4, 5).flatten(2, 3)
-            repeats = (self.stride[0] * self.stride[1] * self.stride[2]) // self.upscale_factor
-            residual = residual.repeat(1, repeats, 1, 1, 1)
-            residual = residual[:, :, self.stride[0] - 1 :]
-
        hidden_states = self.conv(hidden_states)
        hidden_states = hidden_states.reshape(
            batch_size, -1, self.stride[0], self.stride[1], self.stride[2], num_frames, height, width
@@ -240,15 +188,12 @@ class LTXVideoUpsampler3d(nn.Module):
        hidden_states = hidden_states.permute(0, 1, 5, 2, 6, 3, 7, 4).flatten(6, 7).flatten(4, 5).flatten(2, 3)
        hidden_states = hidden_states[:, :, self.stride[0] - 1 :]

-        if self.residual:
-            hidden_states = hidden_states + residual
-
        return hidden_states


-class LTXVideoDownBlock3D(nn.Module):
+class LTXDownBlock3D(nn.Module):
    r"""
-    Down block used in the LTXVideo model.
+    Down block used in the LTX model.

    Args:
        in_channels (`int`):
@@ -290,7 +235,7 @@ class LTXVideoDownBlock3D(nn.Module):
        resnets = []
        for _ in range(num_layers):
            resnets.append(
-                LTXVideoResnetBlock3d(
+                LTXResnetBlock3d(
                    in_channels=in_channels,
                    out_channels=in_channels,
                    dropout=dropout,
@@ -305,7 +250,7 @@ class LTXVideoDownBlock3D(nn.Module):
        if spatio_temporal_scale:
            self.downsamplers = nn.ModuleList(
                [
-                    LTXVideoCausalConv3d(
+                    LTXCausalConv3d(
                        in_channels=in_channels,
                        out_channels=in_channels,
                        kernel_size=3,
@@ -317,7 +262,7 @@ class LTXVideoDownBlock3D(nn.Module):

        self.conv_out = None
        if in_channels != out_channels:
-            self.conv_out = LTXVideoResnetBlock3d(
+            self.conv_out = LTXResnetBlock3d(
                in_channels=in_channels,
                out_channels=out_channels,
                dropout=dropout,
@@ -328,12 +273,7 @@ class LTXVideoDownBlock3D(nn.Module):

        self.gradient_checkpointing = False

-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        temb: Optional[torch.Tensor] = None,
-        generator: Optional[torch.Generator] = None,
-    ) -> torch.Tensor:
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        r"""Forward method of the `LTXDownBlock3D` class."""

        for i, resnet in enumerate(self.resnets):
@@ -345,26 +285,24 @@ class LTXVideoDownBlock3D(nn.Module):

                    return create_forward

-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(resnet), hidden_states, temb, generator
-                )
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states)
            else:
-                hidden_states = resnet(hidden_states, temb, generator)
+                hidden_states = resnet(hidden_states)

        if self.downsamplers is not None:
            for downsampler in self.downsamplers:
                hidden_states = downsampler(hidden_states)

        if self.conv_out is not None:
-            hidden_states = self.conv_out(hidden_states, temb, generator)
+            hidden_states = self.conv_out(hidden_states)

        return hidden_states


 # Adapted from diffusers.models.autoencoders.autoencoder_kl_cogvideox.CogVideoMidBlock3d
-class LTXVideoMidBlock3d(nn.Module):
+class LTXMidBlock3d(nn.Module):
    r"""
-    A middle block used in the LTXVideo model.
+    A middle block used in the LTX model.

    Args:
        in_channels (`int`):
@@ -391,51 +329,28 @@ class LTXVideoMidBlock3d(nn.Module):
        resnet_eps: float = 1e-6,
        resnet_act_fn: str = "swish",
        is_causal: bool = True,
-        inject_noise: bool = False,
-        timestep_conditioning: bool = False,
    ) -> None:
        super().__init__()

-        self.time_embedder = None
-        if timestep_conditioning:
-            self.time_embedder = PixArtAlphaCombinedTimestepSizeEmbeddings(in_channels * 4, 0)
-
        resnets = []
        for _ in range(num_layers):
            resnets.append(
-                LTXVideoResnetBlock3d(
+                LTXResnetBlock3d(
                    in_channels=in_channels,
                    out_channels=in_channels,
                    dropout=dropout,
                    eps=resnet_eps,
                    non_linearity=resnet_act_fn,
                    is_causal=is_causal,
-                    inject_noise=inject_noise,
-                    timestep_conditioning=timestep_conditioning,
                )
            )
        self.resnets = nn.ModuleList(resnets)

        self.gradient_checkpointing = False

-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        temb: Optional[torch.Tensor] = None,
-        generator: Optional[torch.Generator] = None,
-    ) -> torch.Tensor:
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        r"""Forward method of the `LTXMidBlock3D` class."""

-        if self.time_embedder is not None:
-            temb = self.time_embedder(
-                timestep=temb.flatten(),
-                resolution=None,
-                aspect_ratio=None,
-                batch_size=hidden_states.size(0),
-                hidden_dtype=hidden_states.dtype,
-            )
-            temb = temb.view(hidden_states.size(0), -1, 1, 1, 1)
-
        for i, resnet in enumerate(self.resnets):
            if torch.is_grad_enabled() and self.gradient_checkpointing:

@@ -445,18 +360,16 @@ class LTXVideoMidBlock3d(nn.Module):

                    return create_forward

-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(resnet), hidden_states, temb, generator
-                )
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states)
            else:
-                hidden_states = resnet(hidden_states, temb, generator)
+                hidden_states = resnet(hidden_states)

        return hidden_states


-class LTXVideoUpBlock3d(nn.Module):
+class LTXUpBlock3d(nn.Module):
    r"""
-    Up block used in the LTXVideo model.
+    Up block used in the LTX model.

    Args:
        in_channels (`int`):
@@ -490,82 +403,45 @@ class LTXVideoUpBlock3d(nn.Module):
        resnet_act_fn: str = "swish",
        spatio_temporal_scale: bool = True,
        is_causal: bool = True,
-        inject_noise: bool = False,
-        timestep_conditioning: bool = False,
-        upsample_residual: bool = False,
-        upscale_factor: int = 1,
    ):
        super().__init__()

        out_channels = out_channels or in_channels

-        self.time_embedder = None
-        if timestep_conditioning:
-            self.time_embedder = PixArtAlphaCombinedTimestepSizeEmbeddings(in_channels * 4, 0)
-
        self.conv_in = None
        if in_channels != out_channels:
-            self.conv_in = LTXVideoResnetBlock3d(
+            self.conv_in = LTXResnetBlock3d(
                in_channels=in_channels,
                out_channels=out_channels,
                dropout=dropout,
                eps=resnet_eps,
                non_linearity=resnet_act_fn,
                is_causal=is_causal,
-                inject_noise=inject_noise,
-                timestep_conditioning=timestep_conditioning,
            )

        self.upsamplers = None
        if spatio_temporal_scale:
-            self.upsamplers = nn.ModuleList(
-                [
-                    LTXVideoUpsampler3d(
-                        out_channels * upscale_factor,
-                        stride=(2, 2, 2),
-                        is_causal=is_causal,
-                        residual=upsample_residual,
-                        upscale_factor=upscale_factor,
-                    )
-                ]
-            )
+            self.upsamplers = nn.ModuleList([LTXUpsampler3d(out_channels, stride=(2, 2, 2), is_causal=is_causal)])

        resnets = []
        for _ in range(num_layers):
            resnets.append(
-                LTXVideoResnetBlock3d(
+                LTXResnetBlock3d(
                    in_channels=out_channels,
                    out_channels=out_channels,
                    dropout=dropout,
                    eps=resnet_eps,
                    non_linearity=resnet_act_fn,
                    is_causal=is_causal,
-                    inject_noise=inject_noise,
-                    timestep_conditioning=timestep_conditioning,
                )
            )
        self.resnets = nn.ModuleList(resnets)

        self.gradient_checkpointing = False

-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        temb: Optional[torch.Tensor] = None,
-        generator: Optional[torch.Generator] = None,
-    ) -> torch.Tensor:
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        if self.conv_in is not None:
-            hidden_states = self.conv_in(hidden_states, temb, generator)
-
-        if self.time_embedder is not None:
-            temb = self.time_embedder(
-                timestep=temb.flatten(),
-                resolution=None,
-                aspect_ratio=None,
-                batch_size=hidden_states.size(0),
-                hidden_dtype=hidden_states.dtype,
-            )
-            temb = temb.view(hidden_states.size(0), -1, 1, 1, 1)
+            hidden_states = self.conv_in(hidden_states)

        if self.upsamplers is not None:
            for upsampler in self.upsamplers:
@@ -580,18 +456,16 @@ class LTXVideoUpBlock3d(nn.Module):

                    return create_forward

-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(resnet), hidden_states, temb, generator
-                )
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states)
            else:
-                hidden_states = resnet(hidden_states, temb, generator)
+                hidden_states = resnet(hidden_states)

        return hidden_states


-class LTXVideoEncoder3d(nn.Module):
+class LTXEncoder3d(nn.Module):
    r"""
-    The `LTXVideoEncoder3d` layer of a variational autoencoder that encodes input video samples to its latent
+    The `LTXEncoder3D` layer of a variational autoencoder that encodes input video samples to its latent
    representation.

    Args:
@@ -635,7 +509,7 @@ class LTXVideoEncoder3d(nn.Module):

        output_channel = block_out_channels[0]

-        self.conv_in = LTXVideoCausalConv3d(
+        self.conv_in = LTXCausalConv3d(
            in_channels=self.in_channels,
            out_channels=output_channel,
            kernel_size=3,
@@ -650,7 +524,7 @@ class LTXVideoEncoder3d(nn.Module):
            input_channel = output_channel
            output_channel = block_out_channels[i + 1] if i + 1 < num_block_out_channels else block_out_channels[i]

-            down_block = LTXVideoDownBlock3D(
+            down_block = LTXDownBlock3D(
                in_channels=input_channel,
                out_channels=output_channel,
                num_layers=layers_per_block[i],
@@ -662,7 +536,7 @@ class LTXVideoEncoder3d(nn.Module):
            self.down_blocks.append(down_block)

        # mid block
-        self.mid_block = LTXVideoMidBlock3d(
+        self.mid_block = LTXMidBlock3d(
            in_channels=output_channel,
            num_layers=layers_per_block[-1],
            resnet_eps=resnet_norm_eps,
@@ -672,14 +546,14 @@ class LTXVideoEncoder3d(nn.Module):
        # out
        self.norm_out = RMSNorm(out_channels, eps=1e-8, elementwise_affine=False)
        self.conv_act = nn.SiLU()
-        self.conv_out = LTXVideoCausalConv3d(
+        self.conv_out = LTXCausalConv3d(
            in_channels=output_channel, out_channels=out_channels + 1, kernel_size=3, stride=1, is_causal=is_causal
        )

        self.gradient_checkpointing = False

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        r"""The forward method of the `LTXVideoEncoder3d` class."""
+        r"""The forward method of the `LTXEncoder3D` class."""

        p = self.patch_size
        p_t = self.patch_size_t
@@ -725,10 +599,9 @@ class LTXVideoEncoder3d(nn.Module):
        return hidden_states


-class LTXVideoDecoder3d(nn.Module):
+class LTXDecoder3d(nn.Module):
    r"""
-    The `LTXVideoDecoder3d` layer of a variational autoencoder that decodes its latent representation into an output
-    sample.
+    The `LTXDecoder3d` layer of a variational autoencoder that decodes its latent representation into an output sample.

    Args:
        in_channels (`int`, defaults to 128):
@@ -749,8 +622,6 @@ class LTXVideoDecoder3d(nn.Module):
            Epsilon value for ResNet normalization layers.
        is_causal (`bool`, defaults to `False`):
            Whether this layer behaves causally (future frames depend only on past frames) or not.
-        timestep_conditioning (`bool`, defaults to `False`):
-            Whether to condition the model on timesteps.
    """

    def __init__(
@@ -764,10 +635,6 @@ class LTXVideoDecoder3d(nn.Module):
        patch_size_t: int = 1,
        resnet_norm_eps: float = 1e-6,
        is_causal: bool = False,
-        inject_noise: Tuple[bool, ...] = (False, False, False, False),
-        timestep_conditioning: bool = False,
-        upsample_residual: Tuple[bool, ...] = (False, False, False, False),
-        upsample_factor: Tuple[bool, ...] = (1, 1, 1, 1),
    ) -> None:
        super().__init__()

@@ -778,42 +645,30 @@ class LTXVideoDecoder3d(nn.Module):
        block_out_channels = tuple(reversed(block_out_channels))
        spatio_temporal_scaling = tuple(reversed(spatio_temporal_scaling))
        layers_per_block = tuple(reversed(layers_per_block))
-        inject_noise = tuple(reversed(inject_noise))
-        upsample_residual = tuple(reversed(upsample_residual))
-        upsample_factor = tuple(reversed(upsample_factor))
        output_channel = block_out_channels[0]

-        self.conv_in = LTXVideoCausalConv3d(
+        self.conv_in = LTXCausalConv3d(
            in_channels=in_channels, out_channels=output_channel, kernel_size=3, stride=1, is_causal=is_causal
        )

-        self.mid_block = LTXVideoMidBlock3d(
-            in_channels=output_channel,
-            num_layers=layers_per_block[0],
-            resnet_eps=resnet_norm_eps,
-            is_causal=is_causal,
-            inject_noise=inject_noise[0],
-            timestep_conditioning=timestep_conditioning,
+        self.mid_block = LTXMidBlock3d(
+            in_channels=output_channel, num_layers=layers_per_block[0], resnet_eps=resnet_norm_eps, is_causal=is_causal
        )

        # up blocks
        num_block_out_channels = len(block_out_channels)
        self.up_blocks = nn.ModuleList([])
        for i in range(num_block_out_channels):
-            input_channel = output_channel // upsample_factor[i]
-            output_channel = block_out_channels[i] // upsample_factor[i]
+            input_channel = output_channel
+            output_channel = block_out_channels[i]

-            up_block = LTXVideoUpBlock3d(
+            up_block = LTXUpBlock3d(
                in_channels=input_channel,
                out_channels=output_channel,
                num_layers=layers_per_block[i + 1],
                resnet_eps=resnet_norm_eps,
                spatio_temporal_scale=spatio_temporal_scaling[i],
                is_causal=is_causal,
-                inject_noise=inject_noise[i + 1],
-                timestep_conditioning=timestep_conditioning,
-                upsample_residual=upsample_residual[i],
-                upscale_factor=upsample_factor[i],
            )

            self.up_blocks.append(up_block)
@@ -821,20 +676,13 @@ class LTXVideoDecoder3d(nn.Module):
        # out
        self.norm_out = RMSNorm(out_channels, eps=1e-8, elementwise_affine=False)
        self.conv_act = nn.SiLU()
-        self.conv_out = LTXVideoCausalConv3d(
+        self.conv_out = LTXCausalConv3d(
            in_channels=output_channel, out_channels=self.out_channels, kernel_size=3, stride=1, is_causal=is_causal
        )

-        # timestep embedding
-        self.time_embedder = None
-        self.scale_shift_table = None
-        if timestep_conditioning:
-            self.time_embedder = PixArtAlphaCombinedTimestepSizeEmbeddings(output_channel * 2, 0)
-            self.scale_shift_table = nn.Parameter(torch.randn(2, output_channel) / output_channel**0.5)
-
        self.gradient_checkpointing = False

-    def forward(self, hidden_states: torch.Tensor, temb: Optional[torch.Tensor] = None) -> torch.Tensor:
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states = self.conv_in(hidden_states)

        if torch.is_grad_enabled() and self.gradient_checkpointing:
@@ -845,33 +693,17 @@ class LTXVideoDecoder3d(nn.Module):

                return create_forward

-            hidden_states = torch.utils.checkpoint.checkpoint(
-                create_custom_forward(self.mid_block), hidden_states, temb
-            )
+            hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(self.mid_block), hidden_states)

            for up_block in self.up_blocks:
-                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(up_block), hidden_states, temb)
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(up_block), hidden_states)
        else:
-            hidden_states = self.mid_block(hidden_states, temb)
+            hidden_states = self.mid_block(hidden_states)

            for up_block in self.up_blocks:
-                hidden_states = up_block(hidden_states, temb)
+                hidden_states = up_block(hidden_states)

        hidden_states = self.norm_out(hidden_states.movedim(1, -1)).movedim(-1, 1)
-
-        if self.time_embedder is not None:
-            temb = self.time_embedder(
-                timestep=temb.flatten(),
-                resolution=None,
-                aspect_ratio=None,
-                batch_size=hidden_states.size(0),
-                hidden_dtype=hidden_states.dtype,
-            )
-            temb = temb.view(hidden_states.size(0), -1, 1, 1, 1).unflatten(1, (2, -1))
-            temb = temb + self.scale_shift_table[None, ..., None, None, None]
-            shift, scale = temb.unbind(dim=1)
-            hidden_states = hidden_states * (1 + scale) + shift
-
        hidden_states = self.conv_act(hidden_states)
        hidden_states = self.conv_out(hidden_states)

@@ -934,15 +766,8 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
        out_channels: int = 3,
        latent_channels: int = 128,
        block_out_channels: Tuple[int, ...] = (128, 256, 512, 512),
-        decoder_block_out_channels: Tuple[int, ...] = (128, 256, 512, 512),
-        layers_per_block: Tuple[int, ...] = (4, 3, 3, 3, 4),
-        decoder_layers_per_block: Tuple[int, ...] = (4, 3, 3, 3, 4),
        spatio_temporal_scaling: Tuple[bool, ...] = (True, True, True, False),
-        decoder_spatio_temporal_scaling: Tuple[bool, ...] = (True, True, True, False),
-        decoder_inject_noise: Tuple[bool, ...] = (False, False, False, False, False),
-        upsample_residual: Tuple[bool, ...] = (False, False, False, False),
-        upsample_factor: Tuple[int, ...] = (1, 1, 1, 1),
-        timestep_conditioning: bool = False,
+        layers_per_block: Tuple[int, ...] = (4, 3, 3, 3, 4),
        patch_size: int = 4,
        patch_size_t: int = 1,
        resnet_norm_eps: float = 1e-6,
@@ -952,7 +777,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
    ) -> None:
        super().__init__()

-        self.encoder = LTXVideoEncoder3d(
+        self.encoder = LTXEncoder3d(
            in_channels=in_channels,
            out_channels=latent_channels,
            block_out_channels=block_out_channels,
@@ -963,20 +788,16 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
            resnet_norm_eps=resnet_norm_eps,
            is_causal=encoder_causal,
        )
-        self.decoder = LTXVideoDecoder3d(
+        self.decoder = LTXDecoder3d(
            in_channels=latent_channels,
            out_channels=out_channels,
-            block_out_channels=decoder_block_out_channels,
-            spatio_temporal_scaling=decoder_spatio_temporal_scaling,
-            layers_per_block=decoder_layers_per_block,
+            block_out_channels=block_out_channels,
+            spatio_temporal_scaling=spatio_temporal_scaling,
+            layers_per_block=layers_per_block,
            patch_size=patch_size,
            patch_size_t=patch_size_t,
            resnet_norm_eps=resnet_norm_eps,
            is_causal=decoder_causal,
-            timestep_conditioning=timestep_conditioning,
-            inject_noise=decoder_inject_noise,
-            upsample_residual=upsample_residual,
-            upsample_factor=upsample_factor,
        )

        latents_mean = torch.zeros((latent_channels,), requires_grad=False)
@@ -1016,7 +837,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
        self.tile_sample_stride_width = 448

    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, (LTXVideoEncoder3d, LTXVideoDecoder3d)):
+        if isinstance(module, (LTXEncoder3d, LTXDecoder3d)):
            module.gradient_checkpointing = value

    def enable_tiling(
@@ -1115,15 +936,13 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
            return (posterior,)
        return AutoencoderKLOutput(latent_dist=posterior)

-    def _decode(
-        self, z: torch.Tensor, temb: Optional[torch.Tensor] = None, return_dict: bool = True
-    ) -> Union[DecoderOutput, torch.Tensor]:
+    def _decode(self, z: torch.Tensor, return_dict: bool = True) -> Union[DecoderOutput, torch.Tensor]:
        batch_size, num_channels, num_frames, height, width = z.shape
        tile_latent_min_height = self.tile_sample_min_height // self.spatial_compression_ratio
        tile_latent_min_width = self.tile_sample_stride_width // self.spatial_compression_ratio

        if self.use_tiling and (width > tile_latent_min_width or height > tile_latent_min_height):
-            return self.tiled_decode(z, temb, return_dict=return_dict)
+            return self.tiled_decode(z, return_dict=return_dict)

        if self.use_framewise_decoding:
            # TODO(aryan): requires investigation
@@ -1133,7 +952,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
                "should be possible, please submit a PR to https://github.com/huggingface/diffusers/pulls."
            )
        else:
-            dec = self.decoder(z, temb)
+            dec = self.decoder(z)

        if not return_dict:
            return (dec,)
@@ -1141,9 +960,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
        return DecoderOutput(sample=dec)

    @apply_forward_hook
-    def decode(
-        self, z: torch.Tensor, temb: Optional[torch.Tensor] = None, return_dict: bool = True
-    ) -> Union[DecoderOutput, torch.Tensor]:
+    def decode(self, z: torch.Tensor, return_dict: bool = True) -> Union[DecoderOutput, torch.Tensor]:
        """
        Decode a batch of images.

@@ -1158,15 +975,10 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
                returned.
        """
        if self.use_slicing and z.shape[0] > 1:
-            if temb is not None:
-                decoded_slices = [
-                    self._decode(z_slice, t_slice).sample for z_slice, t_slice in (z.split(1), temb.split(1))
-                ]
-            else:
-                decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
+            decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
            decoded = torch.cat(decoded_slices)
        else:
-            decoded = self._decode(z, temb).sample
+            decoded = self._decode(z).sample

        if not return_dict:
            return (decoded,)
@@ -1248,9 +1060,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
        enc = torch.cat(result_rows, dim=3)[:, :, :, :latent_height, :latent_width]
        return enc

-    def tiled_decode(
-        self, z: torch.Tensor, temb: Optional[torch.Tensor], return_dict: bool = True
-    ) -> Union[DecoderOutput, torch.Tensor]:
+    def tiled_decode(self, z: torch.Tensor, return_dict: bool = True) -> Union[DecoderOutput, torch.Tensor]:
        r"""
        Decode a batch of images using a tiled decoder.

@@ -1291,9 +1101,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
                        "should be possible, please submit a PR to https://github.com/huggingface/diffusers/pulls."
                    )
                else:
-                    time = self.decoder(
-                        z[:, :, :, i : i + tile_latent_min_height, j : j + tile_latent_min_width], temb
-                    )
+                    time = self.decoder(z[:, :, :, i : i + tile_latent_min_height, j : j + tile_latent_min_width])

                row.append(time)
            rows.append(row)
@@ -1321,7 +1129,6 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
    def forward(
        self,
        sample: torch.Tensor,
-        temb: Optional[torch.Tensor] = None,
        sample_posterior: bool = False,
        return_dict: bool = True,
        generator: Optional[torch.Generator] = None,
@@ -1332,7 +1139,7 @@ class AutoencoderKLLTXVideo(ModelMixin, ConfigMixin, FromOriginalModelMixin):
            z = posterior.sample(generator=generator)
        else:
            z = posterior.mode()
-        dec = self.decode(z, temb)
+        dec = self.decode(z)
        if not return_dict:
            return (dec,)
        return dec
@@ -691,7 +691,7 @@ class CogVideoXPatchEmbed(nn.Module):
            output_type="pt",
        )
        pos_embedding = pos_embedding.flatten(0, 1)
-        joint_pos_embedding = pos_embedding.new_zeros(
+        joint_pos_embedding = torch.zeros(
            1, self.max_text_seq_length + num_patches, self.embed_dim, requires_grad=False
        )
        joint_pos_embedding.data[:, self.max_text_seq_length :].copy_(pos_embedding)
@@ -748,10 +748,10 @@ class CogVideoXPatchEmbed(nn.Module):
                pos_embedding = self._get_positional_embeddings(
                    height, width, pre_time_compression_frames, device=embeds.device
                )
+                pos_embedding = pos_embedding.to(dtype=embeds.dtype)
            else:
                pos_embedding = self.pos_embedding

-            pos_embedding = pos_embedding.to(dtype=embeds.dtype)
            embeds = embeds + pos_embedding

        return embeds
@@ -1535,7 +1535,7 @@ class ImageProjection(nn.Module):
        batch_size = image_embeds.shape[0]

        # image
-        image_embeds = self.image_embeds(image_embeds.to(self.image_embeds.weight.dtype))
+        image_embeds = self.image_embeds(image_embeds)
        image_embeds = image_embeds.reshape(batch_size, self.num_image_text_embeds, -1)
        image_embeds = self.norm(image_embeds)
        return image_embeds
@@ -2396,187 +2396,6 @@ class IPAdapterFaceIDPlusImageProjection(nn.Module):
        return out


-class IPAdapterTimeImageProjectionBlock(nn.Module):
-    """Block for IPAdapterTimeImageProjection.
-
-    Args:
-        hidden_dim (`int`, defaults to 1280):
-            The number of hidden channels.
-        dim_head (`int`, defaults to 64):
-            The number of head channels.
-        heads (`int`, defaults to 20):
-            Parallel attention heads.
-        ffn_ratio (`int`, defaults to 4):
-            The expansion ratio of feedforward network hidden layer channels.
-    """
-
-    def __init__(
-        self,
-        hidden_dim: int = 1280,
-        dim_head: int = 64,
-        heads: int = 20,
-        ffn_ratio: int = 4,
-    ) -> None:
-        super().__init__()
-        from .attention import FeedForward
-
-        self.ln0 = nn.LayerNorm(hidden_dim)
-        self.ln1 = nn.LayerNorm(hidden_dim)
-        self.attn = Attention(
-            query_dim=hidden_dim,
-            cross_attention_dim=hidden_dim,
-            dim_head=dim_head,
-            heads=heads,
-            bias=False,
-            out_bias=False,
-        )
-        self.ff = FeedForward(hidden_dim, hidden_dim, activation_fn="gelu", mult=ffn_ratio, bias=False)
-
-        # AdaLayerNorm
-        self.adaln_silu = nn.SiLU()
-        self.adaln_proj = nn.Linear(hidden_dim, 4 * hidden_dim)
-        self.adaln_norm = nn.LayerNorm(hidden_dim)
-
-        # Set attention scale and fuse KV
-        self.attn.scale = 1 / math.sqrt(math.sqrt(dim_head))
-        self.attn.fuse_projections()
-        self.attn.to_k = None
-        self.attn.to_v = None
-
-    def forward(self, x: torch.Tensor, latents: torch.Tensor, timestep_emb: torch.Tensor) -> torch.Tensor:
-        """Forward pass.
-
-        Args:
-            x (`torch.Tensor`):
-                Image features.
-            latents (`torch.Tensor`):
-                Latent features.
-            timestep_emb (`torch.Tensor`):
-                Timestep embedding.
-
-        Returns:
-            `torch.Tensor`: Output latent features.
-        """
-
-        # Shift and scale for AdaLayerNorm
-        emb = self.adaln_proj(self.adaln_silu(timestep_emb))
-        shift_msa, scale_msa, shift_mlp, scale_mlp = emb.chunk(4, dim=1)
-
-        # Fused Attention
-        residual = latents
-        x = self.ln0(x)
-        latents = self.ln1(latents) * (1 + scale_msa[:, None]) + shift_msa[:, None]
-
-        batch_size = latents.shape[0]
-
-        query = self.attn.to_q(latents)
-        kv_input = torch.cat((x, latents), dim=-2)
-        key, value = self.attn.to_kv(kv_input).chunk(2, dim=-1)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // self.attn.heads
-
-        query = query.view(batch_size, -1, self.attn.heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, self.attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, self.attn.heads, head_dim).transpose(1, 2)
-
-        weight = (query * self.attn.scale) @ (key * self.attn.scale).transpose(-2, -1)
-        weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
-        latents = weight @ value
-
-        latents = latents.transpose(1, 2).reshape(batch_size, -1, self.attn.heads * head_dim)
-        latents = self.attn.to_out[0](latents)
-        latents = self.attn.to_out[1](latents)
-        latents = latents + residual
-
-        ## FeedForward
-        residual = latents
-        latents = self.adaln_norm(latents) * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
-        return self.ff(latents) + residual
-
-
-# Modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py
-class IPAdapterTimeImageProjection(nn.Module):
-    """Resampler of SD3 IP-Adapter with timestep embedding.
-
-    Args:
-        embed_dim (`int`, defaults to 1152):
-            The feature dimension.
-        output_dim (`int`, defaults to 2432):
-            The number of output channels.
-        hidden_dim (`int`, defaults to 1280):
-            The number of hidden channels.
-        depth (`int`, defaults to 4):
-            The number of blocks.
-        dim_head (`int`, defaults to 64):
-            The number of head channels.
-        heads (`int`, defaults to 20):
-            Parallel attention heads.
-        num_queries (`int`, defaults to 64):
-            The number of queries.
-        ffn_ratio (`int`, defaults to 4):
-            The expansion ratio of feedforward network hidden layer channels.
-        timestep_in_dim (`int`, defaults to 320):
-            The number of input channels for timestep embedding.
-        timestep_flip_sin_to_cos (`bool`, defaults to True):
-            Flip the timestep embedding order to `cos, sin` (if True) or `sin, cos` (if False).
-        timestep_freq_shift (`int`, defaults to 0):
-            Controls the timestep delta between frequencies between dimensions.
-    """
-
-    def __init__(
-        self,
-        embed_dim: int = 1152,
-        output_dim: int = 2432,
-        hidden_dim: int = 1280,
-        depth: int = 4,
-        dim_head: int = 64,
-        heads: int = 20,
-        num_queries: int = 64,
-        ffn_ratio: int = 4,
-        timestep_in_dim: int = 320,
-        timestep_flip_sin_to_cos: bool = True,
-        timestep_freq_shift: int = 0,
-    ) -> None:
-        super().__init__()
-        self.latents = nn.Parameter(torch.randn(1, num_queries, hidden_dim) / hidden_dim**0.5)
-        self.proj_in = nn.Linear(embed_dim, hidden_dim)
-        self.proj_out = nn.Linear(hidden_dim, output_dim)
-        self.norm_out = nn.LayerNorm(output_dim)
-        self.layers = nn.ModuleList(
-            [IPAdapterTimeImageProjectionBlock(hidden_dim, dim_head, heads, ffn_ratio) for _ in range(depth)]
-        )
-        self.time_proj = Timesteps(timestep_in_dim, timestep_flip_sin_to_cos, timestep_freq_shift)
-        self.time_embedding = TimestepEmbedding(timestep_in_dim, hidden_dim, act_fn="silu")
-
-    def forward(self, x: torch.Tensor, timestep: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
-        """Forward pass.
-
-        Args:
-            x (`torch.Tensor`):
-                Image features.
-            timestep (`torch.Tensor`):
-                Timestep in denoising process.
-        Returns:
-            `Tuple`[`torch.Tensor`, `torch.Tensor`]: The pair (latents, timestep_emb).
-        """
-        timestep_emb = self.time_proj(timestep).to(dtype=x.dtype)
-        timestep_emb = self.time_embedding(timestep_emb)
-
-        latents = self.latents.repeat(x.size(0), 1, 1)
-
-        x = self.proj_in(x)
-        x = x + timestep_emb[:, None]
-
-        for block in self.layers:
-            latents = block(x, latents, timestep_emb)
-
-        latents = self.proj_out(latents)
-        latents = self.norm_out(latents)
-
-        return latents, timestep_emb
-
-
 class MultiIPAdapterImageProjection(nn.Module):
    def __init__(self, IPAdapterImageProjectionLayers: Union[List[nn.Module], Tuple[nn.Module]]):
        super().__init__()
@@ -228,7 +228,7 @@ def load_model_dict_into_meta(
            else:
                model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else ""
                raise ValueError(
-                    f"Cannot load {model_name_or_path_str} because {param_name} expected shape {empty_state_dict[param_name].shape}, but got {param.shape}. If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example."
+                    f"Cannot load {model_name_or_path_str} because {param_name} expected shape {empty_state_dict[param_name]}, but got {param.shape}. If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example."
                )

        if is_quantized and (
@@ -99,39 +99,21 @@ def get_parameter_device(parameter: torch.nn.Module) -> torch.device:


 def get_parameter_dtype(parameter: torch.nn.Module) -> torch.dtype:
-    """
-    Returns the first found floating dtype in parameters if there is one, otherwise returns the last dtype it found.
-    """
-    last_dtype = None
-    for param in parameter.parameters():
-        last_dtype = param.dtype
-        if param.is_floating_point():
-            return param.dtype
+    try:
+        return next(parameter.parameters()).dtype
+    except StopIteration:
+        try:
+            return next(parameter.buffers()).dtype
+        except StopIteration:
+            # For torch.nn.DataParallel compatibility in PyTorch 1.5

-    for buffer in parameter.buffers():
-        last_dtype = buffer.dtype
-        if buffer.is_floating_point():
-            return buffer.dtype
+            def find_tensor_attributes(module: torch.nn.Module) -> List[Tuple[str, Tensor]]:
+                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
+                return tuples

-    if last_dtype is not None:
-        # if no floating dtype was found return whatever the first dtype is
-        return last_dtype
-
-    # For nn.DataParallel compatibility in PyTorch > 1.5
-    def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:
-        tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
-        return tuples
-
-    gen = parameter._named_members(get_members_fn=find_tensor_attributes)
-    last_tuple = None
-    for tuple in gen:
-        last_tuple = tuple
-        if tuple[1].is_floating_point():
-            return tuple[1].dtype
-
-    if last_tuple is not None:
-        # fallback to the last dtype
-        return last_tuple[1].dtype
+            gen = parameter._named_members(get_members_fn=find_tensor_attributes)
+            first_tuple = next(gen)
+            return first_tuple[1].dtype


 class ModelMixin(torch.nn.Module, PushToHubMixin):
@@ -718,9 +700,10 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
            hf_quantizer = None

        if hf_quantizer is not None:
-            if device_map is not None:
+            is_bnb_quantization_method = hf_quantizer.quantization_config.quant_method.value == "bitsandbytes"
+            if is_bnb_quantization_method and device_map is not None:
                raise NotImplementedError(
-                    "Currently, providing `device_map` is not supported for quantized models. Providing `device_map` as an input will be added in the future."
+                    "Currently, `device_map` is automatically inferred for quantized bitsandbytes models. Support for providing `device_map` as an input will be added in the future."
                )

            hf_quantizer.validate_environment(torch_dtype=torch_dtype, from_flax=from_flax, device_map=device_map)
@@ -819,7 +802,6 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
                    revision=revision,
                    subfolder=subfolder or "",
                )
-                # TODO: https://github.com/huggingface/diffusers/issues/10013
                if hf_quantizer is not None:
                    model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)
                    logger.info("Merged sharded checkpoints as `hf_quantizer` is not None.")
@@ -242,7 +242,6 @@ class SanaTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
        patch_size: int = 1,
        norm_elementwise_affine: bool = False,
        norm_eps: float = 1e-6,
-        interpolation_scale: Optional[int] = None,
    ) -> None:
        super().__init__()

@@ -250,14 +249,14 @@ class SanaTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
        inner_dim = num_attention_heads * attention_head_dim

        # 1. Patch Embedding
-        interpolation_scale = interpolation_scale if interpolation_scale is not None else max(sample_size // 64, 1)
        self.patch_embed = PatchEmbed(
            height=sample_size,
            width=sample_size,
            patch_size=patch_size,
            in_channels=in_channels,
            embed_dim=inner_dim,
-            interpolation_scale=interpolation_scale,
+            interpolation_scale=None,
+            pos_embed_type=None,
        )

        # 2. Additional condition embeddings
@@ -21,7 +21,7 @@ import torch.nn as nn
 import torch.nn.functional as F

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import FluxTransformer2DLoadersMixin, FromOriginalModelMixin, PeftAdapterMixin
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
 from ...models.attention import FeedForward
 from ...models.attention_processor import (
    Attention,
@@ -177,18 +177,13 @@ class FluxTransformerBlock(nn.Module):
        )
        joint_attention_kwargs = joint_attention_kwargs or {}
        # Attention.
-        attention_outputs = self.attn(
+        attn_output, context_attn_output = self.attn(
            hidden_states=norm_hidden_states,
            encoder_hidden_states=norm_encoder_hidden_states,
            image_rotary_emb=image_rotary_emb,
            **joint_attention_kwargs,
        )

-        if len(attention_outputs) == 2:
-            attn_output, context_attn_output = attention_outputs
-        elif len(attention_outputs) == 3:
-            attn_output, context_attn_output, ip_attn_output = attention_outputs
-
        # Process attention outputs for the `hidden_states`.
        attn_output = gate_msa.unsqueeze(1) * attn_output
        hidden_states = hidden_states + attn_output
@@ -200,8 +195,6 @@ class FluxTransformerBlock(nn.Module):
        ff_output = gate_mlp.unsqueeze(1) * ff_output

        hidden_states = hidden_states + ff_output
-        if len(attention_outputs) == 3:
-            hidden_states = hidden_states + ip_attn_output

        # Process attention outputs for the `encoder_hidden_states`.

@@ -219,9 +212,7 @@ class FluxTransformerBlock(nn.Module):
        return encoder_hidden_states, hidden_states


-class FluxTransformer2DModel(
-    ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin, FluxTransformer2DLoadersMixin
-):
+class FluxTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
    """
    The Transformer model introduced in Flux.

@@ -491,11 +482,6 @@ class FluxTransformer2DModel(
        ids = torch.cat((txt_ids, img_ids), dim=0)
        image_rotary_emb = self.pos_embed(ids)

-        if joint_attention_kwargs is not None and "ip_adapter_image_embeds" in joint_attention_kwargs:
-            ip_adapter_image_embeds = joint_attention_kwargs.pop("ip_adapter_image_embeds")
-            ip_hidden_states = self.encoder_hid_proj(ip_adapter_image_embeds)
-            joint_attention_kwargs.update({"ip_hidden_states": ip_hidden_states})
-
        for index_block, block in enumerate(self.transformer_blocks):
            if torch.is_grad_enabled() and self.gradient_checkpointing:

@@ -18,11 +18,8 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F

-from diffusers.loaders import FromOriginalModelMixin
-
 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import PeftAdapterMixin
-from ...utils import USE_PEFT_BACKEND, is_torch_version, logging, scale_lora_layers, unscale_lora_layers
+from ...utils import is_torch_version
 from ..attention import FeedForward
 from ..attention_processor import Attention, AttentionProcessor
 from ..embeddings import (
@@ -35,9 +32,6 @@ from ..modeling_utils import ModelMixin
 from ..normalization import AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle


-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
 class HunyuanVideoAttnProcessor2_0:
    def __init__(self):
        if not hasattr(F, "scaled_dot_product_attention"):
@@ -502,47 +496,7 @@ class HunyuanVideoTransformerBlock(nn.Module):
        return hidden_states, encoder_hidden_states


-class HunyuanVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
-    r"""
-    A Transformer model for video-like data used in [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo).
-
-    Args:
-        in_channels (`int`, defaults to `16`):
-            The number of channels in the input.
-        out_channels (`int`, defaults to `16`):
-            The number of channels in the output.
-        num_attention_heads (`int`, defaults to `24`):
-            The number of heads to use for multi-head attention.
-        attention_head_dim (`int`, defaults to `128`):
-            The number of channels in each head.
-        num_layers (`int`, defaults to `20`):
-            The number of layers of dual-stream blocks to use.
-        num_single_layers (`int`, defaults to `40`):
-            The number of layers of single-stream blocks to use.
-        num_refiner_layers (`int`, defaults to `2`):
-            The number of layers of refiner blocks to use.
-        mlp_ratio (`float`, defaults to `4.0`):
-            The ratio of the hidden layer size to the input size in the feedforward network.
-        patch_size (`int`, defaults to `2`):
-            The size of the spatial patches to use in the patch embedding layer.
-        patch_size_t (`int`, defaults to `1`):
-            The size of the tmeporal patches to use in the patch embedding layer.
-        qk_norm (`str`, defaults to `rms_norm`):
-            The normalization to use for the query and key projections in the attention layers.
-        guidance_embeds (`bool`, defaults to `True`):
-            Whether to use guidance embeddings in the model.
-        text_embed_dim (`int`, defaults to `4096`):
-            Input dimension of text embeddings from the text encoder.
-        pooled_projection_dim (`int`, defaults to `768`):
-            The dimension of the pooled projection of the text embeddings.
-        rope_theta (`float`, defaults to `256.0`):
-            The value of theta to use in the RoPE layer.
-        rope_axes_dim (`Tuple[int]`, defaults to `(16, 56, 56)`):
-            The dimensions of the axes to use in the RoPE layer.
-    """
-
-    _supports_gradient_checkpointing = True
-
+class HunyuanVideoTransformer3DModel(ModelMixin, ConfigMixin):
    @register_to_config
    def __init__(
        self,
@@ -676,24 +630,8 @@ class HunyuanVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin,
        encoder_attention_mask: torch.Tensor,
        pooled_projections: torch.Tensor,
        guidance: torch.Tensor = None,
-        attention_kwargs: Optional[Dict[str, Any]] = None,
        return_dict: bool = True,
    ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:
-        if attention_kwargs is not None:
-            attention_kwargs = attention_kwargs.copy()
-            lora_scale = attention_kwargs.pop("scale", 1.0)
-        else:
-            lora_scale = 1.0
-
-        if USE_PEFT_BACKEND:
-            # weight the lora layers by setting `lora_scale` for each PEFT layer
-            scale_lora_layers(self, lora_scale)
-        else:
-            if attention_kwargs is not None and attention_kwargs.get("scale", None) is not None:
-                logger.warning(
-                    "Passing `scale` via `attention_kwargs` when not using the PEFT backend is ineffective."
-                )
-
        batch_size, num_channels, num_frames, height, width = hidden_states.shape
        p, p_t = self.config.patch_size, self.config.patch_size_t
        post_patch_num_frames = num_frames // p_t
@@ -713,16 +651,14 @@ class HunyuanVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin,
        condition_sequence_length = encoder_hidden_states.shape[1]
        sequence_length = latent_sequence_length + condition_sequence_length
        attention_mask = torch.zeros(
-            batch_size, sequence_length, device=hidden_states.device, dtype=torch.bool
-        )  # [B, N]
+            batch_size, sequence_length, sequence_length, device=hidden_states.device, dtype=torch.bool
+        )  # [B, N, N]

        effective_condition_sequence_length = encoder_attention_mask.sum(dim=1, dtype=torch.int)  # [B,]
        effective_sequence_length = latent_sequence_length + effective_condition_sequence_length

        for i in range(batch_size):
-            attention_mask[i, : effective_sequence_length[i]] = True
-        # [B, 1, 1, N], for broadcasting across attention heads
-        attention_mask = attention_mask.unsqueeze(1).unsqueeze(1)
+            attention_mask[i, : effective_sequence_length[i], : effective_sequence_length[i]] = True

        # 4. Transformer blocks
        if torch.is_grad_enabled() and self.gradient_checkpointing:
@@ -781,10 +717,6 @@ class HunyuanVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin,
        hidden_states = hidden_states.permute(0, 4, 1, 5, 2, 6, 3, 7)
        hidden_states = hidden_states.flatten(6, 7).flatten(4, 5).flatten(2, 3)

-        if USE_PEFT_BACKEND:
-            # remove `lora_scale` from each PEFT layer
-            unscale_lora_layers(self, lora_scale)
-
        if not return_dict:
            return (hidden_states,)

@@ -35,7 +35,7 @@ from ..normalization import AdaLayerNormSingle, RMSNorm
 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name


-class LTXVideoAttentionProcessor2_0:
+class LTXAttentionProcessor2_0:
    r"""
    Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0). This is
    used in the LTX model. It applies a normalization layer and rotary embedding on the query and key vector.
@@ -44,7 +44,7 @@ class LTXVideoAttentionProcessor2_0:
    def __init__(self):
        if not hasattr(F, "scaled_dot_product_attention"):
            raise ImportError(
-                "LTXVideoAttentionProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
+                "LTXAttentionProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
            )

    def __call__(
@@ -92,7 +92,7 @@ class LTXVideoAttentionProcessor2_0:
        return hidden_states


-class LTXVideoRotaryPosEmbed(nn.Module):
+class LTXRotaryPosEmbed(nn.Module):
    def __init__(
        self,
        dim: int,
@@ -164,7 +164,7 @@ class LTXVideoRotaryPosEmbed(nn.Module):


@maybe_allow_in_graph
-class LTXVideoTransformerBlock(nn.Module):
+class LTXTransformerBlock(nn.Module):
    r"""
    Transformer block used in [LTX](https://huggingface.co/Lightricks/LTX-Video).

@@ -208,7 +208,7 @@ class LTXVideoTransformerBlock(nn.Module):
            cross_attention_dim=None,
            out_bias=attention_out_bias,
            qk_norm=qk_norm,
-            processor=LTXVideoAttentionProcessor2_0(),
+            processor=LTXAttentionProcessor2_0(),
        )

        self.norm2 = RMSNorm(dim, eps=eps, elementwise_affine=elementwise_affine)
@@ -221,7 +221,7 @@ class LTXVideoTransformerBlock(nn.Module):
            bias=attention_bias,
            out_bias=attention_out_bias,
            qk_norm=qk_norm,
-            processor=LTXVideoAttentionProcessor2_0(),
+            processor=LTXAttentionProcessor2_0(),
        )

        self.ff = FeedForward(dim, activation_fn=activation_fn)
@@ -327,7 +327,7 @@ class LTXVideoTransformer3DModel(ModelMixin, ConfigMixin, FromOriginalModelMixin

        self.caption_projection = PixArtAlphaTextProjection(in_features=caption_channels, hidden_size=inner_dim)

-        self.rope = LTXVideoRotaryPosEmbed(
+        self.rope = LTXRotaryPosEmbed(
            dim=inner_dim,
            base_num_frames=20,
            base_height=2048,
@@ -339,7 +339,7 @@ class LTXVideoTransformer3DModel(ModelMixin, ConfigMixin, FromOriginalModelMixin

        self.transformer_blocks = nn.ModuleList(
            [
-                LTXVideoTransformerBlock(
+                LTXTransformerBlock(
                    dim=inner_dim,
                    num_attention_heads=num_attention_heads,
                    attention_head_dim=attention_head_dim,
@@ -20,7 +20,6 @@ import torch.nn as nn

 from ...configuration_utils import ConfigMixin, register_to_config
 from ...loaders import PeftAdapterMixin
-from ...loaders.single_file_model import FromOriginalModelMixin
 from ...utils import USE_PEFT_BACKEND, is_torch_version, logging, scale_lora_layers, unscale_lora_layers
 from ...utils.torch_utils import maybe_allow_in_graph
 from ..attention import FeedForward
@@ -305,7 +304,7 @@ class MochiRoPE(nn.Module):


@maybe_allow_in_graph
-class MochiTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
+class MochiTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
    r"""
    A Transformer model for video-like data introduced in [Mochi](https://huggingface.co/genmo/mochi-1-preview).

@@ -335,7 +334,6 @@ class MochiTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOri
    """

    _supports_gradient_checkpointing = True
-    _no_split_modules = ["MochiTransformerBlock"]

    @register_to_config
    def __init__(
@@ -18,7 +18,7 @@ import torch.nn as nn
 import torch.nn.functional as F

 from ...configuration_utils import ConfigMixin, register_to_config
-from ...loaders import FromOriginalModelMixin, PeftAdapterMixin, SD3Transformer2DLoadersMixin
+from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
 from ...models.attention import FeedForward, JointTransformerBlock
 from ...models.attention_processor import (
    Attention,
@@ -103,9 +103,7 @@ class SD3SingleTransformerBlock(nn.Module):
        return hidden_states


-class SD3Transformer2DModel(
-    ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin, SD3Transformer2DLoadersMixin
-):
+class SD3Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
    """
    The Transformer model introduced in Stable Diffusion 3.

@@ -351,8 +349,8 @@ class SD3Transformer2DModel(
                Input `hidden_states`.
            encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
-            pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`):
-                Embeddings projected from the embeddings of input conditions.
+            pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected
+                from the embeddings of input conditions.
            timestep (`torch.LongTensor`):
                Used to indicate denoising step.
            block_controlnet_hidden_states (`list` of `torch.Tensor`):
@@ -392,12 +390,6 @@ class SD3Transformer2DModel(
        temb = self.time_text_embed(timestep, pooled_projections)
        encoder_hidden_states = self.context_embedder(encoder_hidden_states)

-        if joint_attention_kwargs is not None and "ip_adapter_image_embeds" in joint_attention_kwargs:
-            ip_adapter_image_embeds = joint_attention_kwargs.pop("ip_adapter_image_embeds")
-            ip_hidden_states, ip_temb = self.image_proj(ip_adapter_image_embeds, timestep)
-
-            joint_attention_kwargs.update(ip_hidden_states=ip_hidden_states, temb=ip_temb)
-
        for index_block, block in enumerate(self.transformer_blocks):
            # Skip specified layers
            is_skip = True if skip_layers is not None and index_block in skip_layers else False
@@ -419,15 +411,11 @@ class SD3Transformer2DModel(
                    hidden_states,
                    encoder_hidden_states,
                    temb,
-                    joint_attention_kwargs,
                    **ckpt_kwargs,
                )
            elif not is_skip:
                encoder_hidden_states, hidden_states = block(
-                    hidden_states=hidden_states,
-                    encoder_hidden_states=encoder_hidden_states,
-                    temb=temb,
-                    joint_attention_kwargs=joint_attention_kwargs,
+                    hidden_states=hidden_states, encoder_hidden_states=encoder_hidden_states, temb=temb
                )

            # controlnet residual
@@ -89,8 +89,6 @@ class UNet2DModel(ModelMixin, ConfigMixin):
            conditioning with `class_embed_type` equal to `None`.
    """

-    _supports_gradient_checkpointing = True
-
    @register_to_config
    def __init__(
        self,
@@ -99,7 +97,6 @@ class UNet2DModel(ModelMixin, ConfigMixin):
        out_channels: int = 3,
        center_input_sample: bool = False,
        time_embedding_type: str = "positional",
-        time_embedding_dim: Optional[int] = None,
        freq_shift: int = 0,
        flip_sin_to_cos: bool = True,
        down_block_types: Tuple[str, ...] = ("DownBlock2D", "AttnDownBlock2D", "AttnDownBlock2D", "AttnDownBlock2D"),
@@ -125,7 +122,7 @@ class UNet2DModel(ModelMixin, ConfigMixin):
        super().__init__()

        self.sample_size = sample_size
-        time_embed_dim = time_embedding_dim or block_out_channels[0] * 4
+        time_embed_dim = block_out_channels[0] * 4

        # Check inputs
        if len(down_block_types) != len(up_block_types):
@@ -243,10 +240,6 @@ class UNet2DModel(ModelMixin, ConfigMixin):
        self.conv_act = nn.SiLU()
        self.conv_out = nn.Conv2d(block_out_channels[0], out_channels, kernel_size=3, padding=1)

-    def _set_gradient_checkpointing(self, module, value=False):
-        if hasattr(module, "gradient_checkpointing"):
-            module.gradient_checkpointing = value
-
    def forward(
        self,
        sample: torch.Tensor,
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Dhruv Nair	91d92efab9	Update docs/source/en/quantization/gguf.md Co-authored-by: Aryan <aryan@huggingface.co>	2024-12-18 17:36:27 +05:30
DN6	da61e8f536	update	2024-12-18 10:48:20 +05:30