update

Merge branch 'main' into custom-code-updates
2025-07-28 05:33:00 +02:00 · 2025-07-28 08:32:01 +05:30 · 2025-07-23 10:23:35 -10:00 · 2025-07-23 06:19:40 -10:00 · 2025-07-23 19:42:46 +05:30 · 2025-07-23 17:49:38 +05:30
44 changed files with 2266 additions and 1350 deletions
@@ -1,141 +0,0 @@
-name: Fast PR tests for Modular
-
-on:
-  pull_request:
-    branches: [main]
-    paths:
-      - "src/diffusers/modular_pipelines/**.py"
-      - "src/diffusers/models/modeling_utils.py"
-      - "src/diffusers/models/model_loading_utils.py"
-      - "src/diffusers/pipelines/pipeline_utils.py"
-      - "src/diffusers/pipeline_loading_utils.py"
-      - "src/diffusers/loaders/lora_base.py"
-      - "src/diffusers/loaders/lora_pipeline.py"
-      - "src/diffusers/loaders/peft.py"
-      - "tests/modular_pipelines/**.py"
-      - ".github/**.yml"
-      - "utils/**.py"
-      - "setup.py"
-  push:
-    branches:
-      - ci-*
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
-  cancel-in-progress: true
-
-env:
-  DIFFUSERS_IS_CI: yes
-  HF_HUB_ENABLE_HF_TRANSFER: 1
-  OMP_NUM_THREADS: 4
-  MKL_NUM_THREADS: 4
-  PYTEST_TIMEOUT: 60
-
-jobs:
-  check_code_quality:
-    runs-on: ubuntu-22.04
-    steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python
-        uses: actions/setup-python@v4
-        with:
-          python-version: "3.10"
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install .[quality]
-      - name: Check quality
-        run: make quality
-      - name: Check if failure
-        if: ${{ failure() }}
-        run: |
-          echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY
-
-  check_repository_consistency:
-    needs: check_code_quality
-    runs-on: ubuntu-22.04
-    steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python
-        uses: actions/setup-python@v4
-        with:
-          python-version: "3.10"
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install .[quality]
-      - name: Check repo consistency
-        run: |
-          python utils/check_copies.py
-          python utils/check_dummies.py
-          python utils/check_support_list.py
-          make deps_table_check_updated
-      - name: Check if failure
-        if: ${{ failure() }}
-        run: |
-          echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY
-
-  run_fast_tests:
-    needs: [check_code_quality, check_repository_consistency]
-    strategy:
-      fail-fast: false
-      matrix:
-        config:
-          - name: Fast PyTorch Modular Pipeline CPU tests
-            framework: pytorch_pipelines
-            runner: aws-highmemory-32-plus
-            image: diffusers/diffusers-pytorch-cpu
-            report: torch_cpu_modular_pipelines
-
-    name: ${{ matrix.config.name }}
-
-    runs-on:
-      group: ${{ matrix.config.runner }}
-
-    container:
-      image: ${{ matrix.config.image }}
-      options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/
-
-    defaults:
-      run:
-        shell: bash
-
-    steps:
-    - name: Checkout diffusers
-      uses: actions/checkout@v3
-      with:
-        fetch-depth: 2
-
-    - name: Install dependencies
-      run: |
-        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
-        python -m uv pip install -e [quality,test]
-        pip uninstall transformers -y && python -m uv pip install -U transformers@git+https://github.com/huggingface/transformers.git --no-deps
-        pip uninstall accelerate -y && python -m uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
-
-    - name: Environment
-      run: |
-        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
-        python utils/print_env.py
-
-    - name: Run fast PyTorch Pipeline CPU tests
-      if: ${{ matrix.config.framework == 'pytorch_pipelines' }}
-      run: |
-        python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
-        python -m pytest -n 8 --max-worker-restart=0 --dist=loadfile \
-          -s -v -k "not Flax and not Onnx" \
-          --make-reports=tests_${{ matrix.config.report }} \
-          tests/modular_pipelines
-
-    - name: Failure short reports
-      if: ${{ failure() }}
-      run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt
-
-    - name: Test suite reports artifacts
-      if: ${{ always() }}
-      uses: actions/upload-artifact@v4
-      with:
-        name: pr_${{ matrix.config.framework }}_${{ matrix.config.report }}_test_reports
-        path: reports
-
-
@@ -971,6 +971,7 @@ class DreamBoothDataset(Dataset):

    def __init__(
        self,
+        args,
        instance_data_root,
        instance_prompt,
        class_prompt,
@@ -980,10 +981,8 @@ class DreamBoothDataset(Dataset):
        class_num=None,
        size=1024,
        repeats=1,
-        center_crop=False,
    ):
        self.size = size
-        self.center_crop = center_crop

        self.instance_prompt = instance_prompt
        self.custom_instance_prompts = None
@@ -1058,7 +1057,7 @@ class DreamBoothDataset(Dataset):
        if interpolation is None:
            raise ValueError(f"Unsupported interpolation mode {interpolation=}.")
        train_resize = transforms.Resize(size, interpolation=interpolation)
-        train_crop = transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size)
+        train_crop = transforms.CenterCrop(size) if args.center_crop else transforms.RandomCrop(size)
        train_flip = transforms.RandomHorizontalFlip(p=1.0)
        train_transforms = transforms.Compose(
            [
@@ -1075,11 +1074,11 @@ class DreamBoothDataset(Dataset):
                # flip
                image = train_flip(image)
            if args.center_crop:
-                y1 = max(0, int(round((image.height - args.resolution) / 2.0)))
-                x1 = max(0, int(round((image.width - args.resolution) / 2.0)))
+                y1 = max(0, int(round((image.height - self.size) / 2.0)))
+                x1 = max(0, int(round((image.width - self.size) / 2.0)))
                image = train_crop(image)
            else:
-                y1, x1, h, w = train_crop.get_params(image, (args.resolution, args.resolution))
+                y1, x1, h, w = train_crop.get_params(image, (self.size, self.size))
                image = crop(image, y1, x1, h, w)
            image = train_transforms(image)
            self.pixel_values.append(image)
@@ -1102,7 +1101,7 @@ class DreamBoothDataset(Dataset):
        self.image_transforms = transforms.Compose(
            [
                transforms.Resize(size, interpolation=interpolation),
-                transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),
+                transforms.CenterCrop(size) if args.center_crop else transforms.RandomCrop(size),
                transforms.ToTensor(),
                transforms.Normalize([0.5], [0.5]),
            ]
@@ -1827,6 +1826,7 @@ def main(args):

    # Dataset and DataLoaders creation:
    train_dataset = DreamBoothDataset(
+        args=args,
        instance_data_root=args.instance_data_dir,
        instance_prompt=args.instance_prompt,
        train_text_encoder_ti=args.train_text_encoder_ti,
@@ -1836,7 +1836,6 @@ def main(args):
        class_num=args.num_class_images,
        size=args.resolution,
        repeats=args.repeats,
-        center_crop=args.center_crop,
    )

    train_dataloader = torch.utils.data.DataLoader(
@@ -366,6 +366,8 @@ else:
        [
            "StableDiffusionXLAutoBlocks",
            "StableDiffusionXLModularPipeline",
+            "WanAutoBlocks",
+            "WanModularPipeline",
        ]
    )
    _import_structure["pipelines"].extend(
@@ -999,6 +1001,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
        from .modular_pipelines import (
            StableDiffusionXLAutoBlocks,
            StableDiffusionXLModularPipeline,
+            WanAutoBlocks,
+            WanModularPipeline,
        )
        from .pipelines import (
            AllegroPipeline,
@@ -107,6 +107,7 @@ class TransformerBlockRegistry:
 def _register_attention_processors_metadata():
    from ..models.attention_processor import AttnProcessor2_0
    from ..models.transformers.transformer_cogview4 import CogView4AttnProcessor
+    from ..models.transformers.transformer_wan import WanAttnProcessor2_0

    # AttnProcessor2_0
    AttentionProcessorRegistry.register(
@@ -124,6 +125,14 @@ def _register_attention_processors_metadata():
        ),
    )

+    # WanAttnProcessor2_0
+    AttentionProcessorRegistry.register(
+        model_class=WanAttnProcessor2_0,
+        metadata=AttentionProcessorMetadata(
+            skip_processor_output_fn=_skip_proc_output_fn_Attention_WanAttnProcessor2_0,
+        ),
+    )
+

 def _register_transformer_blocks_metadata():
    from ..models.attention import BasicTransformerBlock
@@ -261,4 +270,5 @@ def _skip_attention___ret___hidden_states___encoder_hidden_states(self, *args, *

 _skip_proc_output_fn_Attention_AttnProcessor2_0 = _skip_attention___ret___hidden_states
 _skip_proc_output_fn_Attention_CogView4AttnProcessor = _skip_attention___ret___hidden_states___encoder_hidden_states
+_skip_proc_output_fn_Attention_WanAttnProcessor2_0 = _skip_attention___ret___hidden_states
 # fmt: on
@@ -91,10 +91,19 @@ class AttentionScoreSkipFunctionMode(torch.overrides.TorchFunctionMode):
        if kwargs is None:
            kwargs = {}
        if func is torch.nn.functional.scaled_dot_product_attention:
+            query = kwargs.get("query", None)
+            key = kwargs.get("key", None)
            value = kwargs.get("value", None)
-            if value is None:
-                value = args[2]
-            return value
+            query = query if query is not None else args[0]
+            key = key if key is not None else args[1]
+            value = value if value is not None else args[2]
+            # If the Q sequence length does not match KV sequence length, methods like
+            # Perturbed Attention Guidance cannot be used (because the caller expects
+            # the same sequence length as Q, but if we return V here, it will not match).
+            # When Q.shape[2] != V.shape[2], PAG will essentially not be applied and
+            # the overall effect would that be of normal CFG with a scale of (guidance_scale + perturbed_guidance_scale).
+            if query.shape[2] == value.shape[2]:
+                return value
        return func(*args, **kwargs)


@@ -38,18 +38,29 @@ from ..utils import (
 from ..utils.constants import DIFFUSERS_ATTN_BACKEND, DIFFUSERS_ATTN_CHECKS


-logger = get_logger(__name__)  # pylint: disable=invalid-name
+_REQUIRED_FLASH_VERSION = "2.6.3"
+_REQUIRED_SAGE_VERSION = "2.1.1"
+_REQUIRED_FLEX_VERSION = "2.5.0"
+_REQUIRED_XLA_VERSION = "2.2"
+_REQUIRED_XFORMERS_VERSION = "0.0.29"
+
+_CAN_USE_FLASH_ATTN = is_flash_attn_available() and is_flash_attn_version(">=", _REQUIRED_FLASH_VERSION)
+_CAN_USE_FLASH_ATTN_3 = is_flash_attn_3_available()
+_CAN_USE_SAGE_ATTN = is_sageattention_available() and is_sageattention_version(">=", _REQUIRED_SAGE_VERSION)
+_CAN_USE_FLEX_ATTN = is_torch_version(">=", _REQUIRED_FLEX_VERSION)
+_CAN_USE_NPU_ATTN = is_torch_npu_available()
+_CAN_USE_XLA_ATTN = is_torch_xla_available() and is_torch_xla_version(">=", _REQUIRED_XLA_VERSION)
+_CAN_USE_XFORMERS_ATTN = is_xformers_available() and is_xformers_version(">=", _REQUIRED_XFORMERS_VERSION)


-if is_flash_attn_available() and is_flash_attn_version(">=", "2.6.3"):
+if _CAN_USE_FLASH_ATTN:
    from flash_attn import flash_attn_func, flash_attn_varlen_func
 else:
-    logger.warning("`flash-attn` is not available or the version is too old. Please install `flash-attn>=2.6.3`.")
    flash_attn_func = None
    flash_attn_varlen_func = None


-if is_flash_attn_3_available():
+if _CAN_USE_FLASH_ATTN_3:
    from flash_attn_interface import flash_attn_func as flash_attn_3_func
    from flash_attn_interface import flash_attn_varlen_func as flash_attn_3_varlen_func
 else:
@@ -57,7 +68,7 @@ else:
    flash_attn_3_varlen_func = None


-if is_sageattention_available() and is_sageattention_version(">=", "2.1.1"):
+if _CAN_USE_SAGE_ATTN:
    from sageattention import (
        sageattn,
        sageattn_qk_int8_pv_fp8_cuda,
@@ -67,9 +78,6 @@ if is_sageattention_available() and is_sageattention_version(">=", "2.1.1"):
        sageattn_varlen,
    )
 else:
-    logger.warning(
-        "`sageattention` is not available or the version is too old. Please install `sageattention>=2.1.1`."
-    )
    sageattn = None
    sageattn_qk_int8_pv_fp16_cuda = None
    sageattn_qk_int8_pv_fp16_triton = None
@@ -78,39 +86,39 @@ else:
    sageattn_varlen = None


-if is_torch_version(">=", "2.5.0"):
+if _CAN_USE_FLEX_ATTN:
    # We cannot import the flex_attention function from the package directly because it is expected (from the
    # pytorch documentation) that the user may compile it. If we import directly, we will not have access to the
    # compiled function.
    import torch.nn.attention.flex_attention as flex_attention


-if is_torch_npu_available():
+if _CAN_USE_NPU_ATTN:
    from torch_npu import npu_fusion_attention
 else:
    npu_fusion_attention = None


-if is_torch_xla_available() and is_torch_xla_version(">", "2.2"):
+if _CAN_USE_XLA_ATTN:
    from torch_xla.experimental.custom_kernel import flash_attention as xla_flash_attention
 else:
    xla_flash_attention = None


-if is_xformers_available() and is_xformers_version(">=", "0.0.29"):
+if _CAN_USE_XFORMERS_ATTN:
    import xformers.ops as xops
 else:
-    logger.warning("`xformers` is not available or the version is too old. Please install `xformers>=0.0.29`.")
    xops = None


+logger = get_logger(__name__)  # pylint: disable=invalid-name
+
 # TODO(aryan): Add support for the following:
 # - Sage Attention++
 # - block sparse, radial and other attention methods
 # - CP with sage attention, flex, xformers, other missing backends
 # - Add support for normal and CP training with backends that don't support it yet

-
 _SAGE_ATTENTION_PV_ACCUM_DTYPE = Literal["fp32", "fp32+fp32"]
 _SAGE_ATTENTION_QK_QUANT_GRAN = Literal["per_thread", "per_warp"]
 _SAGE_ATTENTION_QUANTIZATION_BACKEND = Literal["cuda", "triton"]
@@ -179,13 +187,16 @@ class _AttentionBackendRegistry:


@contextlib.contextmanager
-def attention_backend(backend: AttentionBackendName = AttentionBackendName.NATIVE):
+def attention_backend(backend: Union[str, AttentionBackendName] = AttentionBackendName.NATIVE):
    """
    Context manager to set the active attention backend.
    """
    if backend not in _AttentionBackendRegistry._backends:
        raise ValueError(f"Backend {backend} is not registered.")

+    backend = AttentionBackendName(backend)
+    _check_attention_backend_requirements(backend)
+
    old_backend = _AttentionBackendRegistry._active_backend
    _AttentionBackendRegistry._active_backend = backend

@@ -226,9 +237,10 @@ def dispatch_attention_fn(
        "dropout_p": dropout_p,
        "is_causal": is_causal,
        "scale": scale,
-        "enable_gqa": enable_gqa,
        **attention_kwargs,
    }
+    if is_torch_version(">=", "2.5.0"):
+        kwargs["enable_gqa"] = enable_gqa

    if _AttentionBackendRegistry._checks_enabled:
        removed_kwargs = set(kwargs) - set(_AttentionBackendRegistry._supported_arg_names[backend_name])
@@ -305,6 +317,57 @@ def _check_shape(
 # ===== Helper functions =====


+def _check_attention_backend_requirements(backend: AttentionBackendName) -> None:
+    if backend in [AttentionBackendName.FLASH, AttentionBackendName.FLASH_VARLEN]:
+        if not _CAN_USE_FLASH_ATTN:
+            raise RuntimeError(
+                f"Flash Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `flash-attn>={_REQUIRED_FLASH_VERSION}`."
+            )
+
+    elif backend in [AttentionBackendName._FLASH_3, AttentionBackendName._FLASH_VARLEN_3]:
+        if not _CAN_USE_FLASH_ATTN_3:
+            raise RuntimeError(
+                f"Flash Attention 3 backend '{backend.value}' is not usable because of missing package or the version is too old. Please build FA3 beta release from source."
+            )
+
+    elif backend in [
+        AttentionBackendName.SAGE,
+        AttentionBackendName.SAGE_VARLEN,
+        AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA,
+        AttentionBackendName._SAGE_QK_INT8_PV_FP8_CUDA_SM90,
+        AttentionBackendName._SAGE_QK_INT8_PV_FP16_CUDA,
+        AttentionBackendName._SAGE_QK_INT8_PV_FP16_TRITON,
+    ]:
+        if not _CAN_USE_SAGE_ATTN:
+            raise RuntimeError(
+                f"Sage Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `sageattention>={_REQUIRED_SAGE_VERSION}`."
+            )
+
+    elif backend == AttentionBackendName.FLEX:
+        if not _CAN_USE_FLEX_ATTN:
+            raise RuntimeError(
+                f"Flex Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `torch>=2.5.0`."
+            )
+
+    elif backend == AttentionBackendName._NATIVE_NPU:
+        if not _CAN_USE_NPU_ATTN:
+            raise RuntimeError(
+                f"NPU Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `torch_npu`."
+            )
+
+    elif backend == AttentionBackendName._NATIVE_XLA:
+        if not _CAN_USE_XLA_ATTN:
+            raise RuntimeError(
+                f"XLA Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `torch_xla>={_REQUIRED_XLA_VERSION}`."
+            )
+
+    elif backend == AttentionBackendName.XFORMERS:
+        if not _CAN_USE_XFORMERS_ATTN:
+            raise RuntimeError(
+                f"Xformers Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `xformers>={_REQUIRED_XFORMERS_VERSION}`."
+            )
+
+
@functools.lru_cache(maxsize=128)
 def _prepare_for_flash_attn_or_sage_varlen_without_mask(
    batch_size: int,
@@ -622,19 +622,21 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
                attention as backend.
        """
        from .attention import AttentionModuleMixin
-        from .attention_dispatch import AttentionBackendName
+        from .attention_dispatch import AttentionBackendName, _check_attention_backend_requirements

        # TODO: the following will not be required when everything is refactored to AttentionModuleMixin
        from .attention_processor import Attention, MochiAttention

+        logger.warning("Attention backends are an experimental feature and the API may be subject to change.")
+
        backend = backend.lower()
        available_backends = {x.value for x in AttentionBackendName.__members__.values()}
        if backend not in available_backends:
            raise ValueError(f"`{backend=}` must be one of the following: " + ", ".join(available_backends))
-
        backend = AttentionBackendName(backend)
-        attention_classes = (Attention, MochiAttention, AttentionModuleMixin)
+        _check_attention_backend_requirements(backend)

+        attention_classes = (Attention, MochiAttention, AttentionModuleMixin)
        for module in self.modules():
            if not isinstance(module, attention_classes):
                continue
@@ -651,6 +653,8 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
        from .attention import AttentionModuleMixin
        from .attention_processor import Attention, MochiAttention

+        logger.warning("Attention backends are an experimental feature and the API may be subject to change.")
+
        attention_classes = (Attention, MochiAttention, AttentionModuleMixin)
        for module in self.modules():
            if not isinstance(module, attention_classes):
@@ -165,7 +165,7 @@ class UNet2DConditionModel(
    """

    _supports_gradient_checkpointing = True
-    _no_split_modules = ["BasicTransformerBlock", "ResnetBlock2D", "CrossAttnUpBlock2D"]
+    _no_split_modules = ["BasicTransformerBlock", "ResnetBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"]
    _skip_layerwise_casting_patterns = ["norm"]
    _repeated_blocks = ["BasicTransformerBlock"]

@@ -40,6 +40,7 @@ else:
        "InsertableDict",
    ]
    _import_structure["stable_diffusion_xl"] = ["StableDiffusionXLAutoBlocks", "StableDiffusionXLModularPipeline"]
+    _import_structure["wan"] = ["WanAutoBlocks", "WanModularPipeline"]
    _import_structure["components_manager"] = ["ComponentsManager"]

 if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
@@ -71,6 +72,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
            StableDiffusionXLAutoBlocks,
            StableDiffusionXLModularPipeline,
        )
+        from .wan import WanAutoBlocks, WanModularPipeline
 else:
    import sys

@@ -386,6 +386,7 @@ class ComponentsManager:
                 id(component) is Python's built-in unique identifier for the object
        """
        component_id = f"{name}_{id(component)}"
+        is_new_component = True

        # check for duplicated components
        for comp_id, comp in self.components.items():
@@ -394,6 +395,7 @@ class ComponentsManager:
                if comp_name == name:
                    logger.warning(f"ComponentsManager: component '{name}' already exists as '{comp_id}'")
                    component_id = comp_id
+                    is_new_component = False
                    break
                else:
                    logger.warning(
@@ -426,7 +428,9 @@ class ComponentsManager:
                    logger.warning(
                        f"ComponentsManager: removing existing {name} from collection '{collection}': {comp_id}"
                    )
-                    self.remove(comp_id)
+                    # remove existing component from this collection (if it is not in any other collection, will be removed from ComponentsManager)
+                    self.remove_from_collection(comp_id, collection)
+
                self.collections[collection].add(component_id)
                logger.info(
                    f"ComponentsManager: added component '{name}' in collection '{collection}': {component_id}"
@@ -434,11 +438,29 @@ class ComponentsManager:
        else:
            logger.info(f"ComponentsManager: added component '{name}' as '{component_id}'")

-        if self._auto_offload_enabled:
+        if self._auto_offload_enabled and is_new_component:
            self.enable_auto_cpu_offload(self._auto_offload_device)

        return component_id

+    def remove_from_collection(self, component_id: str, collection: str):
+        """
+        Remove a component from a collection.
+        """
+        if collection not in self.collections:
+            logger.warning(f"Collection '{collection}' not found in ComponentsManager")
+            return
+        if component_id not in self.collections[collection]:
+            logger.warning(f"Component '{component_id}' not found in collection '{collection}'")
+            return
+        # remove from the collection
+        self.collections[collection].remove(component_id)
+        # check if this component is in any other collection
+        comp_colls = [coll for coll, comps in self.collections.items() if component_id in comps]
+        if not comp_colls:  # only if no other collection contains this component, remove it
+            logger.warning(f"ComponentsManager: removing component '{component_id}' from ComponentsManager")
+            self.remove(component_id)
+
    def remove(self, component_id: str = None):
        """
        Remove a component from the ComponentsManager.
@@ -45,6 +45,8 @@ from .modular_pipeline_utils import (
    OutputParam,
    format_components,
    format_configs,
+    format_inputs_short,
+    format_intermediates_short,
    make_doc_string,
 )

@@ -58,12 +60,14 @@ logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 MODULAR_PIPELINE_MAPPING = OrderedDict(
    [
        ("stable-diffusion-xl", "StableDiffusionXLModularPipeline"),
+        ("wan", "WanModularPipeline"),
    ]
 )

 MODULAR_PIPELINE_BLOCKS_MAPPING = OrderedDict(
    [
        ("StableDiffusionXLModularPipeline", "StableDiffusionXLAutoBlocks"),
+        ("WanModularPipeline", "WanAutoBlocks"),
    ]
 )

@@ -74,59 +78,139 @@ class PipelineState:
    [`PipelineState`] stores the state of a pipeline. It is used to pass data between pipeline blocks.
    """

-    values: Dict[str, Any] = field(default_factory=dict)
-    kwargs_mapping: Dict[str, List[str]] = field(default_factory=dict)
+    inputs: Dict[str, Any] = field(default_factory=dict)
+    intermediates: Dict[str, Any] = field(default_factory=dict)
+    input_kwargs: Dict[str, List[str]] = field(default_factory=dict)
+    intermediate_kwargs: Dict[str, List[str]] = field(default_factory=dict)

-    def set(self, key: str, value: Any, kwargs_type: str = None):
+    def set_input(self, key: str, value: Any, kwargs_type: str = None):
        """
-        Add a value to the pipeline state.
+        Add an input to the immutable pipeline state, i.e, pipeline_state.inputs.
+
+        The kwargs_type parameter allows you to associate inputs with specific input types. For example, if you call
+        set_input(prompt_embeds=..., kwargs_type="guider_kwargs"), this input will be automatically fetched when a
+        pipeline block has "guider_kwargs" in its expected_inputs list.

        Args:
-            key (str): The key for the value
-            value (Any): The value to store
-            kwargs_type (str): The kwargs_type with which the value is associated
+            key (str): The key for the input
+            value (Any): The input value
+            kwargs_type (str): The kwargs_type with which the input is associated
        """
-        self.values[key] = value
-
+        self.inputs[key] = value
        if kwargs_type is not None:
-            if kwargs_type not in self.kwargs_mapping:
-                self.kwargs_mapping[kwargs_type] = [key]
+            if kwargs_type not in self.input_kwargs:
+                self.input_kwargs[kwargs_type] = [key]
            else:
-                self.kwargs_mapping[kwargs_type].append(key)
+                self.input_kwargs[kwargs_type].append(key)

-    def get(self, keys: Union[str, List[str]], default: Any = None) -> Union[Any, Dict[str, Any]]:
+    def set_intermediate(self, key: str, value: Any, kwargs_type: str = None):
        """
-        Get one or multiple values from the pipeline state.
+        Add an intermediate value to the mutable pipeline state, i.e, pipeline_state.intermediates.
+
+        The kwargs_type parameter allows you to associate intermediate values with specific input types. For example,
+        if you call set_intermediate(latents=..., kwargs_type="latents_kwargs"), this intermediate value will be
+        automatically fetched when a pipeline block has "latents_kwargs" in its expected_intermediate_inputs list.

        Args:
-            keys (Union[str, List[str]]): Key or list of keys for the values
-            default (Any): The default value to return if not found
+            key (str): The key for the intermediate value
+            value (Any): The intermediate value
+            kwargs_type (str): The kwargs_type with which the intermediate value is associated
+        """
+        self.intermediates[key] = value
+        if kwargs_type is not None:
+            if kwargs_type not in self.intermediate_kwargs:
+                self.intermediate_kwargs[kwargs_type] = [key]
+            else:
+                self.intermediate_kwargs[kwargs_type].append(key)
+
+    def get_input(self, key: str, default: Any = None) -> Any:
+        """
+        Get an input from the pipeline state.
+
+        Args:
+            key (str): The key for the input
+            default (Any): The default value to return if the input is not found

        Returns:
-            Union[Any, Dict[str, Any]]: Single value if keys is str, dictionary of values if keys is list
+            Any: The input value
        """
-        if isinstance(keys, str):
-            return self.values.get(keys, default)
-        return {key: self.values.get(key, default) for key in keys}
+        value = self.inputs.get(key, default)
+        if value is not None:
+            return deepcopy(value)

-    def get_by_kwargs(self, kwargs_type: str) -> Dict[str, Any]:
+    def get_inputs(self, keys: List[str], default: Any = None) -> Dict[str, Any]:
        """
-        Get all values with matching kwargs_type.
+        Get multiple inputs from the pipeline state.
+
+        Args:
+            keys (List[str]): The keys for the inputs
+            default (Any): The default value to return if the input is not found
+
+        Returns:
+            Dict[str, Any]: Dictionary of inputs with matching keys
+        """
+        return {key: self.inputs.get(key, default) for key in keys}
+
+    def get_inputs_kwargs(self, kwargs_type: str) -> Dict[str, Any]:
+        """
+        Get all inputs with matching kwargs_type.

        Args:
            kwargs_type (str): The kwargs_type to filter by

        Returns:
-            Dict[str, Any]: Dictionary of values with matching kwargs_type
+            Dict[str, Any]: Dictionary of inputs with matching kwargs_type
        """
-        value_names = self.kwargs_mapping.get(kwargs_type, [])
-        return self.get(value_names)
+        input_names = self.input_kwargs.get(kwargs_type, [])
+        return self.get_inputs(input_names)
+
+    def get_intermediate_kwargs(self, kwargs_type: str) -> Dict[str, Any]:
+        """
+        Get all intermediates with matching kwargs_type.
+
+        Args:
+            kwargs_type (str): The kwargs_type to filter by
+
+        Returns:
+            Dict[str, Any]: Dictionary of intermediates with matching kwargs_type
+        """
+        intermediate_names = self.intermediate_kwargs.get(kwargs_type, [])
+        return self.get_intermediates(intermediate_names)
+
+    def get_intermediate(self, key: str, default: Any = None) -> Any:
+        """
+        Get an intermediate value from the pipeline state.
+
+        Args:
+            key (str): The key for the intermediate value
+            default (Any): The default value to return if the intermediate value is not found
+
+        Returns:
+            Any: The intermediate value
+        """
+        return self.intermediates.get(key, default)
+
+    def get_intermediates(self, keys: List[str], default: Any = None) -> Dict[str, Any]:
+        """
+        Get multiple intermediate values from the pipeline state.
+
+        Args:
+            keys (List[str]): The keys for the intermediate values
+            default (Any): The default value to return if the intermediate value is not found
+
+        Returns:
+            Dict[str, Any]: Dictionary of intermediate values with matching keys
+        """
+        return {key: self.intermediates.get(key, default) for key in keys}

    def to_dict(self) -> Dict[str, Any]:
        """
        Convert PipelineState to a dictionary.
+
+        Returns:
+            Dict[str, Any]: Dictionary containing all attributes of the PipelineState
        """
-        return {**self.__dict__}
+        return {**self.__dict__, "inputs": self.inputs, "intermediates": self.intermediates}

    def __repr__(self):
        def format_value(v):
@@ -137,10 +221,21 @@ class PipelineState:
            else:
                return repr(v)

-        values_str = "\n".join(f"    {k}: {format_value(v)}" for k, v in self.values.items())
-        kwargs_mapping_str = "\n".join(f"    {k}: {v}" for k, v in self.kwargs_mapping.items())
+        inputs = "\n".join(f"    {k}: {format_value(v)}" for k, v in self.inputs.items())
+        intermediates = "\n".join(f"    {k}: {format_value(v)}" for k, v in self.intermediates.items())

-        return f"PipelineState(\n  values={{\n{values_str}\n  }},\n  kwargs_mapping={{\n{kwargs_mapping_str}\n  }}\n)"
+        # Format input_kwargs and intermediate_kwargs
+        input_kwargs_str = "\n".join(f"    {k}: {v}" for k, v in self.input_kwargs.items())
+        intermediate_kwargs_str = "\n".join(f"    {k}: {v}" for k, v in self.intermediate_kwargs.items())
+
+        return (
+            f"PipelineState(\n"
+            f"  inputs={{\n{inputs}\n  }},\n"
+            f"  intermediates={{\n{intermediates}\n  }},\n"
+            f"  input_kwargs={{\n{input_kwargs_str}\n  }},\n"
+            f"  intermediate_kwargs={{\n{intermediate_kwargs_str}\n  }}\n"
+            f")"
+        )


@dataclass
@@ -232,6 +327,9 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
    config_name = "modular_config.json"
    model_name = None

+    def __init__(self):
+        self.sub_blocks = InsertableDict()
+
    @classmethod
    def _get_signature_keys(cls, obj):
        parameters = inspect.signature(obj.__init__).parameters
@@ -241,14 +339,6 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):

        return expected_modules, optional_parameters

-    def __init__(self):
-        self.sub_blocks = InsertableDict()
-
-    @property
-    def description(self) -> str:
-        """Description of the block. Must be implemented by subclasses."""
-        return ""
-
    @property
    def expected_components(self) -> List[ComponentSpec]:
        return []
@@ -257,23 +347,11 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
    def expected_configs(self) -> List[ConfigSpec]:
        return []

-    @property
-    def inputs(self) -> List[InputParam]:
-        """List of input parameters. Must be implemented by subclasses."""
-        return []
-
    @property
    def intermediate_outputs(self) -> List[OutputParam]:
        """List of intermediate output parameters. Must be implemented by subclasses."""
        return []

-    def _get_outputs(self):
-        return self.intermediate_outputs
-
-    @property
-    def outputs(self) -> List[OutputParam]:
-        return self._get_outputs()
-
    @classmethod
    def from_pretrained(
        cls,
@@ -358,12 +436,12 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
    def get_block_state(self, state: PipelineState) -> dict:
        """Get all inputs and intermediates in one dictionary"""
        data = {}
-        state_inputs = self.inputs
+        state_inputs = self.inputs + self.intermediate_inputs

        # Check inputs
        for input_param in state_inputs:
            if input_param.name:
-                value = state.get(input_param.name)
+                value = state.get_input(input_param.name) or state.get_intermediate(input_param.name)
                if input_param.required and value is None:
                    raise ValueError(f"Required input '{input_param.name}' is missing")
                elif value is not None or (value is None and input_param.name not in data):
@@ -373,7 +451,9 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
                # if kwargs_type is provided, get all inputs with matching kwargs_type
                if input_param.kwargs_type not in data:
                    data[input_param.kwargs_type] = {}
-                inputs_kwargs = state.get_by_kwargs(input_param.kwargs_type)
+                inputs_kwargs = state.get_inputs_kwargs(input_param.kwargs_type) or state.get_intermediate_kwargs(
+                    input_param.kwargs_type
+                )
                if inputs_kwargs:
                    for k, v in inputs_kwargs.items():
                        if v is not None:
@@ -387,30 +467,25 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
            if not hasattr(block_state, output_param.name):
                raise ValueError(f"Intermediate output '{output_param.name}' is missing in block state")
            param = getattr(block_state, output_param.name)
-            state.set(output_param.name, param, output_param.kwargs_type)
+            state.set_intermediate(output_param.name, param, output_param.kwargs_type)

-        for input_param in self.inputs:
+        for input_param in self.intermediate_inputs:
            if input_param.name and hasattr(block_state, input_param.name):
                param = getattr(block_state, input_param.name)
                # Only add if the value is different from what's in the state
-                current_value = state.get(input_param.name)
+                current_value = state.get_intermediate(input_param.name)
                if current_value is not param:  # Using identity comparison to check if object was modified
-                    state.set(input_param.name, param, input_param.kwargs_type)
-
+                    state.set_intermediate(input_param.name, param, input_param.kwargs_type)
            elif input_param.kwargs_type:
                # if it is a kwargs type, e.g. "guider_input_fields", it is likely to be a list of parameters
                # we need to first find out which inputs are and loop through them.
-                intermediate_kwargs = state.get_by_kwargs(input_param.kwargs_type)
+                intermediate_kwargs = state.get_intermediate_kwargs(input_param.kwargs_type)
                for param_name, current_value in intermediate_kwargs.items():
-                    if param_name is None:
-                        continue
-
                    if not hasattr(block_state, param_name):
                        continue
-
                    param = getattr(block_state, param_name)
                    if current_value is not param:  # Using identity comparison to check if object was modified
-                        state.set(param_name, param, input_param.kwargs_type)
+                        state.set_intermediate(param_name, param, input_param.kwargs_type)

    @staticmethod
    def combine_inputs(*named_input_lists: List[Tuple[str, List[InputParam]]]) -> List[InputParam]:
@@ -478,17 +553,199 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):

        return list(combined_dict.values())

-    @property
-    def input_names(self) -> List[str]:
-        return [input_param.name for input_param in self.inputs]
+
+class PipelineBlock(ModularPipelineBlocks):
+    """
+    A Pipeline Block is the basic building block of a Modular Pipeline.
+
+    This class inherits from [`ModularPipelineBlocks`]. Check the superclass documentation for the generic methods the
+    library implements for all the pipeline blocks (such as loading or saving etc.)
+
+    <Tip warning={true}>
+
+        This is an experimental feature and is likely to change in the future.
+
+    </Tip>
+
+    Args:
+        description (str, optional): A description of the block, defaults to None. Define as a property in subclasses.
+        expected_components (List[ComponentSpec], optional):
+            A list of components that are expected to be used in the block, defaults to []. To override, define as a
+            property in subclasses.
+        expected_configs (List[ConfigSpec], optional):
+            A list of configs that are expected to be used in the block, defaults to []. To override, define as a
+            property in subclasses.
+        inputs (List[InputParam], optional):
+            A list of inputs that are expected to be used in the block, defaults to []. To override, define as a
+            property in subclasses.
+        intermediate_inputs (List[InputParam], optional):
+            A list of intermediate inputs that are expected to be used in the block, defaults to []. To override,
+            define as a property in subclasses.
+        intermediate_outputs (List[OutputParam], optional):
+            A list of intermediate outputs that are expected to be used in the block, defaults to []. To override,
+            define as a property in subclasses.
+        outputs (List[OutputParam], optional):
+            A list of outputs that are expected to be used in the block, defaults to []. To override, define as a
+            property in subclasses.
+        required_inputs (List[str], optional):
+            A list of required inputs that are expected to be used in the block, defaults to []. To override, define as
+            a property in subclasses.
+        required_intermediate_inputs (List[str], optional):
+            A list of required intermediate inputs that are expected to be used in the block, defaults to []. To
+            override, define as a property in subclasses.
+        required_intermediate_outputs (List[str], optional):
+            A list of required intermediate outputs that are expected to be used in the block, defaults to []. To
+            override, define as a property in subclasses.
+    """
+
+    model_name = None
+
+    def __init__(self):
+        self.sub_blocks = InsertableDict()

    @property
-    def intermediate_output_names(self) -> List[str]:
-        return [output_param.name for output_param in self.intermediate_outputs]
+    def description(self) -> str:
+        """Description of the block. Must be implemented by subclasses."""
+        # raise NotImplementedError("description method must be implemented in subclasses")
+        return "TODO: add a description"

    @property
-    def output_names(self) -> List[str]:
-        return [output_param.name for output_param in self.outputs]
+    def expected_components(self) -> List[ComponentSpec]:
+        return []
+
+    @property
+    def expected_configs(self) -> List[ConfigSpec]:
+        return []
+
+    @property
+    def inputs(self) -> List[InputParam]:
+        """List of input parameters. Must be implemented by subclasses."""
+        return []
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        """List of intermediate input parameters. Must be implemented by subclasses."""
+        return []
+
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        """List of intermediate output parameters. Must be implemented by subclasses."""
+        return []
+
+    def _get_outputs(self):
+        return self.intermediate_outputs
+
+    # YiYi TODO: is it too easy for user to unintentionally override these properties?
+    # Adding outputs attributes here for consistency between PipelineBlock/AutoPipelineBlocks/SequentialPipelineBlocks
+    @property
+    def outputs(self) -> List[OutputParam]:
+        return self._get_outputs()
+
+    def _get_required_inputs(self):
+        input_names = []
+        for input_param in self.inputs:
+            if input_param.required:
+                input_names.append(input_param.name)
+        return input_names
+
+    @property
+    def required_inputs(self) -> List[str]:
+        return self._get_required_inputs()
+
+    def _get_required_intermediate_inputs(self):
+        input_names = []
+        for input_param in self.intermediate_inputs:
+            if input_param.required:
+                input_names.append(input_param.name)
+        return input_names
+
+    # YiYi TODO: maybe we do not need this, it is only used in docstring,
+    # intermediate_inputs is by default required, unless you manually handle it inside the block
+    @property
+    def required_intermediate_inputs(self) -> List[str]:
+        return self._get_required_intermediate_inputs()
+
+    def __call__(self, pipeline, state: PipelineState) -> PipelineState:
+        raise NotImplementedError("__call__ method must be implemented in subclasses")
+
+    def __repr__(self):
+        class_name = self.__class__.__name__
+        base_class = self.__class__.__bases__[0].__name__
+
+        # Format description with proper indentation
+        desc_lines = self.description.split("\n")
+        desc = []
+        # First line with "Description:" label
+        desc.append(f"  Description: {desc_lines[0]}")
+        # Subsequent lines with proper indentation
+        if len(desc_lines) > 1:
+            desc.extend(f"      {line}" for line in desc_lines[1:])
+        desc = "\n".join(desc) + "\n"
+
+        # Components section - use format_components with add_empty_lines=False
+        expected_components = getattr(self, "expected_components", [])
+        components_str = format_components(expected_components, indent_level=2, add_empty_lines=False)
+        components = "  " + components_str.replace("\n", "\n  ")
+
+        # Configs section - use format_configs with add_empty_lines=False
+        expected_configs = getattr(self, "expected_configs", [])
+        configs_str = format_configs(expected_configs, indent_level=2, add_empty_lines=False)
+        configs = "  " + configs_str.replace("\n", "\n  ")
+
+        # Inputs section
+        inputs_str = format_inputs_short(self.inputs)
+        inputs = "Inputs:\n    " + inputs_str
+
+        # Intermediates section
+        intermediates_str = format_intermediates_short(
+            self.intermediate_inputs, self.required_intermediate_inputs, self.intermediate_outputs
+        )
+        intermediates = f"Intermediates:\n{intermediates_str}"
+
+        return f"{class_name}(\n  Class: {base_class}\n{desc}{components}\n{configs}\n  {inputs}\n  {intermediates}\n)"
+
+    @property
+    def doc(self):
+        return make_doc_string(
+            self.inputs,
+            self.intermediate_inputs,
+            self.outputs,
+            self.description,
+            class_name=self.__class__.__name__,
+            expected_components=self.expected_components,
+            expected_configs=self.expected_configs,
+        )
+
+    def set_block_state(self, state: PipelineState, block_state: BlockState):
+        for output_param in self.intermediate_outputs:
+            if not hasattr(block_state, output_param.name):
+                raise ValueError(f"Intermediate output '{output_param.name}' is missing in block state")
+            param = getattr(block_state, output_param.name)
+            state.set_intermediate(output_param.name, param, output_param.kwargs_type)
+
+        for input_param in self.intermediate_inputs:
+            if hasattr(block_state, input_param.name):
+                param = getattr(block_state, input_param.name)
+                # Only add if the value is different from what's in the state
+                current_value = state.get_intermediate(input_param.name)
+                if current_value is not param:  # Using identity comparison to check if object was modified
+                    state.set_intermediate(input_param.name, param, input_param.kwargs_type)
+
+        for input_param in self.intermediate_inputs:
+            if input_param.name and hasattr(block_state, input_param.name):
+                param = getattr(block_state, input_param.name)
+                # Only add if the value is different from what's in the state
+                current_value = state.get_intermediate(input_param.name)
+                if current_value is not param:  # Using identity comparison to check if object was modified
+                    state.set_intermediate(input_param.name, param, input_param.kwargs_type)
+            elif input_param.kwargs_type:
+                # if it is a kwargs type, e.g. "guider_input_fields", it is likely to be a list of parameters
+                # we need to first find out which inputs are and loop through them.
+                intermediate_kwargs = state.get_intermediate_kwargs(input_param.kwargs_type)
+                for param_name, current_value in intermediate_kwargs.items():
+                    param = getattr(block_state, param_name)
+                    if current_value is not param:  # Using identity comparison to check if object was modified
+                        state.set_intermediate(param_name, param, input_param.kwargs_type)


 class AutoPipelineBlocks(ModularPipelineBlocks):
@@ -579,6 +836,22 @@ class AutoPipelineBlocks(ModularPipelineBlocks):

        return list(required_by_all)

+    # YiYi TODO: maybe we do not need this, it is only used in docstring,
+    # intermediate_inputs is by default required, unless you manually handle it inside the block
+    @property
+    def required_intermediate_inputs(self) -> List[str]:
+        if None not in self.block_trigger_inputs:
+            return []
+        first_block = next(iter(self.sub_blocks.values()))
+        required_by_all = set(getattr(first_block, "required_intermediate_inputs", set()))
+
+        # Intersect with required inputs from all other blocks
+        for block in list(self.sub_blocks.values())[1:]:
+            block_required = set(getattr(block, "required_intermediate_inputs", set()))
+            required_by_all.intersection_update(block_required)
+
+        return list(required_by_all)
+
    # YiYi TODO: add test for this
    @property
    def inputs(self) -> List[Tuple[str, Any]]:
@@ -592,6 +865,18 @@ class AutoPipelineBlocks(ModularPipelineBlocks):
                input_param.required = False
        return combined_inputs

+    @property
+    def intermediate_inputs(self) -> List[str]:
+        named_inputs = [(name, block.intermediate_inputs) for name, block in self.sub_blocks.items()]
+        combined_inputs = self.combine_inputs(*named_inputs)
+        # mark Required inputs only if that input is required by all the blocks
+        for input_param in combined_inputs:
+            if input_param.name in self.required_intermediate_inputs:
+                input_param.required = True
+            else:
+                input_param.required = False
+        return combined_inputs
+
    @property
    def intermediate_outputs(self) -> List[str]:
        named_outputs = [(name, block.intermediate_outputs) for name, block in self.sub_blocks.items()]
@@ -610,10 +895,10 @@ class AutoPipelineBlocks(ModularPipelineBlocks):

        block = self.trigger_to_block_map.get(None)
        for input_name in self.block_trigger_inputs:
-            if input_name is not None and state.get(input_name) is not None:
+            if input_name is not None and state.get_input(input_name) is not None:
                block = self.trigger_to_block_map[input_name]
                break
-            elif input_name is not None and state.get(input_name) is not None:
+            elif input_name is not None and state.get_intermediate(input_name) is not None:
                block = self.trigger_to_block_map[input_name]
                break

@@ -832,34 +1117,6 @@ class SequentialPipelineBlocks(ModularPipelineBlocks):
            sub_blocks[block_name] = block_cls()
        self.sub_blocks = sub_blocks

-    def _get_inputs(self):
-        inputs = []
-        outputs = set()
-
-        # Go through all blocks in order
-        for block in self.sub_blocks.values():
-            # Add inputs that aren't in outputs yet
-            for inp in block.inputs:
-                if inp.name not in outputs and inp.name not in {input.name for input in inputs}:
-                    inputs.append(inp)
-
-            # Only add outputs if the block cannot be skipped
-            should_add_outputs = True
-            if hasattr(block, "block_trigger_inputs") and None not in block.block_trigger_inputs:
-                should_add_outputs = False
-
-            if should_add_outputs:
-                # Add this block's outputs
-                block_intermediate_outputs = [out.name for out in block.intermediate_outputs]
-                outputs.update(block_intermediate_outputs)
-
-        return inputs
-
-    # YiYi TODO: add test for this
-    @property
-    def inputs(self) -> List[Tuple[str, Any]]:
-        return self._get_inputs()
-
    @property
    def required_inputs(self) -> List[str]:
        # Get the first block from the dictionary
@@ -873,11 +1130,65 @@ class SequentialPipelineBlocks(ModularPipelineBlocks):

        return list(required_by_any)

+    # YiYi TODO: maybe we do not need this, it is only used in docstring,
+    # intermediate_inputs is by default required, unless you manually handle it inside the block
+    @property
+    def required_intermediate_inputs(self) -> List[str]:
+        required_intermediate_inputs = []
+        for input_param in self.intermediate_inputs:
+            if input_param.required:
+                required_intermediate_inputs.append(input_param.name)
+        return required_intermediate_inputs
+
+    # YiYi TODO: add test for this
+    @property
+    def inputs(self) -> List[Tuple[str, Any]]:
+        return self.get_inputs()
+
+    def get_inputs(self):
+        named_inputs = [(name, block.inputs) for name, block in self.sub_blocks.items()]
+        combined_inputs = self.combine_inputs(*named_inputs)
+        # mark Required inputs only if that input is required any of the blocks
+        for input_param in combined_inputs:
+            if input_param.name in self.required_inputs:
+                input_param.required = True
+            else:
+                input_param.required = False
+        return combined_inputs
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return self.get_intermediate_inputs()
+
+    def get_intermediate_inputs(self):
+        inputs = []
+        outputs = set()
+        added_inputs = set()
+
+        # Go through all blocks in order
+        for block in self.sub_blocks.values():
+            # Add inputs that aren't in outputs yet
+            for inp in block.intermediate_inputs:
+                if inp.name not in outputs and inp.name not in added_inputs:
+                    inputs.append(inp)
+                    added_inputs.add(inp.name)
+
+            # Only add outputs if the block cannot be skipped
+            should_add_outputs = True
+            if hasattr(block, "block_trigger_inputs") and None not in block.block_trigger_inputs:
+                should_add_outputs = False
+
+            if should_add_outputs:
+                # Add this block's outputs
+                block_intermediate_outputs = [out.name for out in block.intermediate_outputs]
+                outputs.update(block_intermediate_outputs)
+        return inputs
+
    @property
    def intermediate_outputs(self) -> List[str]:
        named_outputs = []
        for name, block in self.sub_blocks.items():
-            inp_names = {inp.name for inp in block.inputs}
+            inp_names = {inp.name for inp in block.intermediate_inputs}
            # so we only need to list new variables as intermediate_outputs, but if user wants to list these they modified it's still fine (a.k.a we don't enforce)
            # filter out them here so they do not end up as intermediate_outputs
            if name not in inp_names:
@@ -1095,6 +1406,7 @@ class SequentialPipelineBlocks(ModularPipelineBlocks):
    def doc(self):
        return make_doc_string(
            self.inputs,
+            self.intermediate_inputs,
            self.outputs,
            self.description,
            class_name=self.__class__.__name__,
@@ -1144,6 +1456,11 @@ class LoopSequentialPipelineBlocks(ModularPipelineBlocks):
        """List of input parameters. Must be implemented by subclasses."""
        return []

+    @property
+    def loop_intermediate_outputs(self) -> List[OutputParam]:
+        """List of intermediate output parameters. Must be implemented by subclasses."""
+        return []
+
    @property
    def loop_required_inputs(self) -> List[str]:
        input_names = []
@@ -1152,11 +1469,6 @@ class LoopSequentialPipelineBlocks(ModularPipelineBlocks):
                input_names.append(input_param.name)
        return input_names

-    @property
-    def loop_intermediate_outputs(self) -> List[OutputParam]:
-        """List of intermediate output parameters. Must be implemented by subclasses."""
-        return []
-
    # modified from SequentialPipelineBlocks to include loop_expected_components
    @property
    def expected_components(self):
@@ -1183,16 +1495,43 @@ class LoopSequentialPipelineBlocks(ModularPipelineBlocks):
                expected_configs.append(config)
        return expected_configs

-    def _get_inputs(self):
+    # modified from SequentialPipelineBlocks to include loop_inputs
+    def get_inputs(self):
+        named_inputs = [(name, block.inputs) for name, block in self.sub_blocks.items()]
+        named_inputs.append(("loop", self.loop_inputs))
+        combined_inputs = self.combine_inputs(*named_inputs)
+        # mark Required inputs only if that input is required any of the blocks
+        for input_param in combined_inputs:
+            if input_param.name in self.required_inputs:
+                input_param.required = True
+            else:
+                input_param.required = False
+        return combined_inputs
+
+    @property
+    # Copied from diffusers.modular_pipelines.modular_pipeline.SequentialPipelineBlocks.inputs
+    def inputs(self):
+        return self.get_inputs()
+
+    # modified from SequentialPipelineBlocks to include loop_intermediate_inputs
+    @property
+    def intermediate_inputs(self):
+        intermediates = self.get_intermediate_inputs()
+        intermediate_names = [input.name for input in intermediates]
+        for loop_intermediate_input in self.loop_intermediate_inputs:
+            if loop_intermediate_input.name not in intermediate_names:
+                intermediates.append(loop_intermediate_input)
+        return intermediates
+
+    # modified from SequentialPipelineBlocks
+    def get_intermediate_inputs(self):
        inputs = []
-        inputs.extend(self.loop_inputs)
        outputs = set()

-        for name, block in self.sub_blocks.items():
+        # Go through all blocks in order
+        for block in self.sub_blocks.values():
            # Add inputs that aren't in outputs yet
-            for inp in block.inputs:
-                if inp.name not in outputs and inp not in inputs:
-                    inputs.append(inp)
+            inputs.extend(input_name for input_name in block.intermediate_inputs if input_name.name not in outputs)

            # Only add outputs if the block cannot be skipped
            should_add_outputs = True
@@ -1203,20 +1542,8 @@ class LoopSequentialPipelineBlocks(ModularPipelineBlocks):
                # Add this block's outputs
                block_intermediate_outputs = [out.name for out in block.intermediate_outputs]
                outputs.update(block_intermediate_outputs)
-
-        for input_param in inputs:
-            if input_param.name in self.required_inputs:
-                input_param.required = True
-            else:
-                input_param.required = False
-
        return inputs

-    @property
-    # Copied from diffusers.modular_pipelines.modular_pipeline.SequentialPipelineBlocks.inputs
-    def inputs(self):
-        return self._get_inputs()
-
    # modified from SequentialPipelineBlocks, if any additionan input required by the loop is required by the block
    @property
    def required_inputs(self) -> List[str]:
@@ -1234,6 +1561,19 @@ class LoopSequentialPipelineBlocks(ModularPipelineBlocks):

        return list(required_by_any)

+    # YiYi TODO: maybe we do not need this, it is only used in docstring,
+    # intermediate_inputs is by default required, unless you manually handle it inside the block
+    @property
+    def required_intermediate_inputs(self) -> List[str]:
+        required_intermediate_inputs = []
+        for input_param in self.intermediate_inputs:
+            if input_param.required:
+                required_intermediate_inputs.append(input_param.name)
+        for input_param in self.loop_intermediate_inputs:
+            if input_param.required:
+                required_intermediate_inputs.append(input_param.name)
+        return required_intermediate_inputs
+
    # YiYi TODO: this need to be thought about more
    # modified from SequentialPipelineBlocks to include loop_intermediate_outputs
    @property
@@ -1523,6 +1863,96 @@ class ModularPipeline(ConfigMixin, PushToHubMixin):
            params[input_param.name] = input_param.default
        return params

+    def __call__(self, state: PipelineState = None, output: Union[str, List[str]] = None, **kwargs):
+        """
+        Execute the pipeline by running the pipeline blocks with the given inputs.
+
+        Args:
+            state (`PipelineState`, optional):
+                PipelineState instance contains inputs and intermediate values. If None, a new `PipelineState` will be
+                created based on the user inputs and the pipeline blocks's requirement.
+            output (`str` or `List[str]`, optional):
+                Optional specification of what to return:
+                   - None: Returns the complete `PipelineState` with all inputs and intermediates (default)
+                   - str: Returns a specific intermediate value from the state (e.g. `output="image"`)
+                   - List[str]: Returns a dictionary of specific intermediate values (e.g. `output=["image",
+                     "latents"]`)
+
+
+        Examples:
+            ```python
+            # Get complete pipeline state
+            state = pipeline(prompt="A beautiful sunset", num_inference_steps=20)
+            print(state.intermediates)  # All intermediate outputs
+
+            # Get specific output
+            image = pipeline(prompt="A beautiful sunset", output="image")
+
+            # Get multiple specific outputs
+            results = pipeline(prompt="A beautiful sunset", output=["image", "latents"])
+            image, latents = results["image"], results["latents"]
+
+            # Continue from previous state
+            state = pipeline(prompt="A beautiful sunset")
+            new_state = pipeline(state=state, output="image")  # Continue processing
+            ```
+
+        Returns:
+            - If `output` is None: Complete `PipelineState` containing all inputs and intermediates
+            - If `output` is str: The specific intermediate value from the state (e.g. `output="image"`)
+            - If `output` is List[str]: Dictionary mapping output names to their values from the state (e.g.
+              `output=["image", "latents"]`)
+        """
+        if state is None:
+            state = PipelineState()
+
+        # Make a copy of the input kwargs
+        passed_kwargs = kwargs.copy()
+
+        # Add inputs to state, using defaults if not provided in the kwargs or the state
+        # if same input already in the state, will override it if provided in the kwargs
+        intermediate_inputs = [inp.name for inp in self.blocks.intermediate_inputs]
+        for expected_input_param in self.blocks.inputs:
+            name = expected_input_param.name
+            default = expected_input_param.default
+            kwargs_type = expected_input_param.kwargs_type
+            if name in passed_kwargs:
+                if name not in intermediate_inputs:
+                    state.set_input(name, passed_kwargs.pop(name), kwargs_type)
+                else:
+                    state.set_input(name, passed_kwargs[name], kwargs_type)
+            elif name not in state.inputs:
+                state.set_input(name, default, kwargs_type)
+
+        for expected_intermediate_param in self.blocks.intermediate_inputs:
+            name = expected_intermediate_param.name
+            kwargs_type = expected_intermediate_param.kwargs_type
+            if name in passed_kwargs:
+                state.set_intermediate(name, passed_kwargs.pop(name), kwargs_type)
+
+        # Warn about unexpected inputs
+        if len(passed_kwargs) > 0:
+            warnings.warn(f"Unexpected input '{passed_kwargs.keys()}' provided. This input will be ignored.")
+        # Run the pipeline
+        with torch.no_grad():
+            try:
+                _, state = self.blocks(self, state)
+            except Exception:
+                error_msg = f"Error in block: ({self.blocks.__class__.__name__}):\n"
+                logger.error(error_msg)
+                raise
+
+        if output is None:
+            return state
+
+        elif isinstance(output, str):
+            return state.get_intermediate(output)
+
+        elif isinstance(output, (list, tuple)):
+            return state.get_intermediates(output)
+        else:
+            raise ValueError(f"Output '{output}' is not a valid output type")
+
    def load_default_components(self, **kwargs):
        """
        Load from_pretrained components using the loading specs in the config dict.
@@ -2341,92 +2771,3 @@ class ModularPipeline(ConfigMixin, PushToHubMixin):
            type_hint=type_hint,
            **spec_dict,
        )
-
-    def set_progress_bar_config(self, **kwargs):
-        for sub_block_name, sub_block in self.blocks.sub_blocks.items():
-            if hasattr(sub_block, "set_progress_bar_config"):
-                sub_block.set_progress_bar_config(**kwargs)
-
-    def __call__(self, state: PipelineState = None, output: Union[str, List[str]] = None, **kwargs):
-        """
-        Execute the pipeline by running the pipeline blocks with the given inputs.
-
-        Args:
-            state (`PipelineState`, optional):
-                PipelineState instance contains inputs and intermediate values. If None, a new `PipelineState` will be
-                created based on the user inputs and the pipeline blocks's requirement.
-            output (`str` or `List[str]`, optional):
-                Optional specification of what to return:
-                   - None: Returns the complete `PipelineState` with all inputs and intermediates (default)
-                   - str: Returns a specific intermediate value from the state (e.g. `output="image"`)
-                   - List[str]: Returns a dictionary of specific intermediate values (e.g. `output=["image",
-                     "latents"]`)
-
-
-        Examples:
-            ```python
-            # Get complete pipeline state
-            state = pipeline(prompt="A beautiful sunset", num_inference_steps=20)
-            print(state.intermediates)  # All intermediate outputs
-
-            # Get specific output
-            image = pipeline(prompt="A beautiful sunset", output="image")
-
-            # Get multiple specific outputs
-            results = pipeline(prompt="A beautiful sunset", output=["image", "latents"])
-            image, latents = results["image"], results["latents"]
-
-            # Continue from previous state
-            state = pipeline(prompt="A beautiful sunset")
-            new_state = pipeline(state=state, output="image")  # Continue processing
-            ```
-
-        Returns:
-            - If `output` is None: Complete `PipelineState` containing all inputs and intermediates
-            - If `output` is str: The specific intermediate value from the state (e.g. `output="image"`)
-            - If `output` is List[str]: Dictionary mapping output names to their values from the state (e.g.
-              `output=["image", "latents"]`)
-        """
-        if state is None:
-            state = PipelineState()
-
-        # Make a copy of the input kwargs
-        passed_kwargs = kwargs.copy()
-
-        # Add inputs to state, using defaults if not provided in the kwargs or the state
-        # if same input already in the state, will override it if provided in the kwargs
-        intermediate_inputs = [inp.name for inp in self.blocks.inputs]
-        for expected_input_param in self.blocks.inputs:
-            name = expected_input_param.name
-            default = expected_input_param.default
-            kwargs_type = expected_input_param.kwargs_type
-            if name in passed_kwargs:
-                if name not in intermediate_inputs:
-                    state.set(name, passed_kwargs.pop(name), kwargs_type)
-                else:
-                    state.set(name, passed_kwargs[name], kwargs_type)
-            elif name not in state.values:
-                state.set(name, default, kwargs_type)
-
-        # Warn about unexpected inputs
-        if len(passed_kwargs) > 0:
-            warnings.warn(f"Unexpected input '{passed_kwargs.keys()}' provided. This input will be ignored.")
-        # Run the pipeline
-        with torch.no_grad():
-            try:
-                _, state = self.blocks(self, state)
-            except Exception:
-                error_msg = f"Error in block: ({self.blocks.__class__.__name__}):\n"
-                logger.error(error_msg)
-                raise
-
-        if output is None:
-            return state
-
-        if isinstance(output, str):
-            return state.get(output)
-
-        elif isinstance(output, (list, tuple)):
-            return state.get(output)
-        else:
-            raise ValueError(f"Output '{output}' is not a valid output type")
@@ -185,6 +185,8 @@ class ComponentSpec:
        Unique identifier for this spec's pretrained load, composed of repo|subfolder|variant|revision (no empty
        segments).
        """
+        if self.default_creation_method == "from_config":
+            return "null"
        parts = [getattr(self, k) for k in self.loading_fields()]
        parts = ["null" if p is None else p for p in parts]
        return "|".join(p for p in parts if p)
@@ -213,6 +213,11 @@ class StableDiffusionXLInputStep(ModularPipelineBlocks):
    def inputs(self) -> List[InputParam]:
        return [
            InputParam("num_images_per_prompt", default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam(
                "prompt_embeds",
                required=True,
@@ -416,6 +421,11 @@ class StableDiffusionXLImg2ImgSetTimestepsStep(ModularPipelineBlocks):
            InputParam("denoising_start"),
            # YiYi TODO: do we need num_images_per_prompt here?
            InputParam("num_images_per_prompt", default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam(
                "batch_size",
                required=True,
@@ -630,6 +640,11 @@ class StableDiffusionXLInpaintPrepareLatentsStep(ModularPipelineBlocks):
                "`num_inference_steps`. A value of 1, therefore, essentially ignores `image`. Note that in the case of "
                "`denoising_start` being declared as an integer, the value of `strength` will be ignored.",
            ),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam("generator"),
            InputParam(
                "batch_size",
@@ -729,6 +744,8 @@ class StableDiffusionXLInpaintPrepareLatentsStep(ModularPipelineBlocks):
        timestep=None,
        is_strength_max=True,
        add_noise=True,
+        return_noise=False,
+        return_image_latents=False,
    ):
        shape = (
            batch_size,
@@ -751,7 +768,7 @@ class StableDiffusionXLInpaintPrepareLatentsStep(ModularPipelineBlocks):
        if image.shape[1] == 4:
            image_latents = image.to(device=device, dtype=dtype)
            image_latents = image_latents.repeat(batch_size // image_latents.shape[0], 1, 1, 1)
-        elif latents is None and not is_strength_max:
+        elif return_image_latents or (latents is None and not is_strength_max):
            image = image.to(device=device, dtype=dtype)
            image_latents = self._encode_vae_image(components, image=image, generator=generator)
            image_latents = image_latents.repeat(batch_size // image_latents.shape[0], 1, 1, 1)
@@ -769,7 +786,13 @@ class StableDiffusionXLInpaintPrepareLatentsStep(ModularPipelineBlocks):
            noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
            latents = image_latents.to(device)

-        outputs = (latents, noise, image_latents)
+        outputs = (latents,)
+
+        if return_noise:
+            outputs += (noise,)
+
+        if return_image_latents:
+            outputs += (image_latents,)

        return outputs

@@ -841,7 +864,7 @@ class StableDiffusionXLInpaintPrepareLatentsStep(ModularPipelineBlocks):
        block_state.height = block_state.image_latents.shape[-2] * components.vae_scale_factor
        block_state.width = block_state.image_latents.shape[-1] * components.vae_scale_factor

-        block_state.latents, block_state.noise, block_state.image_latents = self.prepare_latents_inpaint(
+        block_state.latents, block_state.noise = self.prepare_latents_inpaint(
            components,
            block_state.batch_size * block_state.num_images_per_prompt,
            components.num_channels_latents,
@@ -855,6 +878,8 @@ class StableDiffusionXLInpaintPrepareLatentsStep(ModularPipelineBlocks):
            timestep=block_state.latent_timestep,
            is_strength_max=block_state.is_strength_max,
            add_noise=block_state.add_noise,
+            return_noise=True,
+            return_image_latents=False,
        )

        # 7. Prepare mask latent variables
@@ -895,6 +920,11 @@ class StableDiffusionXLImg2ImgPrepareLatentsStep(ModularPipelineBlocks):
            InputParam("latents"),
            InputParam("num_images_per_prompt", default=1),
            InputParam("denoising_start"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
            InputParam("generator"),
            InputParam(
                "latent_timestep",
@@ -972,6 +1002,11 @@ class StableDiffusionXLPrepareLatentsStep(ModularPipelineBlocks):
            InputParam("width"),
            InputParam("latents"),
            InputParam("num_images_per_prompt", default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
            InputParam("generator"),
            InputParam(
                "batch_size",
@@ -1094,6 +1129,11 @@ class StableDiffusionXLImg2ImgPrepareAdditionalConditioningStep(ModularPipelineB
            InputParam("num_images_per_prompt", default=1),
            InputParam("aesthetic_score", default=6.0),
            InputParam("negative_aesthetic_score", default=2.0),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
            InputParam(
                "latents",
                required=True,
@@ -1305,6 +1345,11 @@ class StableDiffusionXLPrepareAdditionalConditioningStep(ModularPipelineBlocks):
            InputParam("crops_coords_top_left", default=(0, 0)),
            InputParam("negative_crops_coords_top_left", default=(0, 0)),
            InputParam("num_images_per_prompt", default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
            InputParam(
                "latents",
                required=True,
@@ -1482,6 +1527,11 @@ class StableDiffusionXLControlNetInputStep(ModularPipelineBlocks):
            InputParam("controlnet_conditioning_scale", default=1.0),
            InputParam("guess_mode", default=False),
            InputParam("num_images_per_prompt", default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam(
                "latents",
                required=True,
@@ -23,10 +23,7 @@ from ...image_processor import VaeImageProcessor
 from ...models import AutoencoderKL
 from ...models.attention_processor import AttnProcessor2_0, XFormersAttnProcessor
 from ...utils import logging
-from ..modular_pipeline import (
-    ModularPipelineBlocks,
-    PipelineState,
-)
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
 from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam


@@ -56,12 +53,17 @@ class StableDiffusionXLDecodeStep(ModularPipelineBlocks):
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
            InputParam("output_type", default="pil"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam(
                "latents",
                required=True,
                type_hint=torch.Tensor,
                description="The denoised latents from the denoising step",
-            ),
+            )
        ]

    @property
@@ -91,7 +91,7 @@ class StableDiffusionXLInpaintLoopBeforeDenoiser(ModularPipelineBlocks):
        )

    @property
-    def inputs(self) -> List[str]:
+    def intermediate_inputs(self) -> List[str]:
        return [
            InputParam(
                "latents",
@@ -171,6 +171,11 @@ class StableDiffusionXLLoopDenoiser(ModularPipelineBlocks):
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
            InputParam("cross_attention_kwargs"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam(
                "num_inference_steps",
                required=True,
@@ -272,6 +277,11 @@ class StableDiffusionXLControlNetLoopDenoiser(ModularPipelineBlocks):
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
            InputParam("cross_attention_kwargs"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam(
                "controlnet_cond",
                required=True,
@@ -460,6 +470,11 @@ class StableDiffusionXLLoopAfterDenoiser(ModularPipelineBlocks):
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
            InputParam("eta", default=0.0),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam("generator"),
        ]

@@ -527,6 +542,11 @@ class StableDiffusionXLInpaintLoopAfterDenoiser(ModularPipelineBlocks):
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
            InputParam("eta", default=0.0),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
            InputParam("generator"),
            InputParam(
                "timesteps",
@@ -601,6 +601,11 @@ class StableDiffusionXLVaeEncoderStep(ModularPipelineBlocks):
            InputParam("image", required=True),
            InputParam("height"),
            InputParam("width"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
            InputParam("generator"),
            InputParam("dtype", type_hint=torch.dtype, description="Data type of model tensor inputs"),
            InputParam(
@@ -721,6 +726,11 @@ class StableDiffusionXLInpaintVaeEncoderStep(ModularPipelineBlocks):
            InputParam("image", required=True),
            InputParam("mask_image", required=True),
            InputParam("padding_mask_crop"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
            InputParam("dtype", type_hint=torch.dtype, description="The dtype of the model inputs"),
            InputParam("generator"),
        ]
@@ -247,6 +247,10 @@ SDXL_INPUTS_SCHEMA = {
    "control_mode": InputParam(
        "control_mode", type_hint=List[int], required=True, description="Control mode for union controlnet"
    ),
+}
+
+
+SDXL_INTERMEDIATE_INPUTS_SCHEMA = {
    "prompt_embeds": InputParam(
        "prompt_embeds",
        type_hint=torch.Tensor,
@@ -267,6 +271,13 @@ SDXL_INPUTS_SCHEMA = {
    "preprocess_kwargs": InputParam(
        "preprocess_kwargs", type_hint=Optional[dict], description="Kwargs for ImageProcessor"
    ),
+    "latents": InputParam(
+        "latents", type_hint=torch.Tensor, required=True, description="Initial latents for denoising process"
+    ),
+    "timesteps": InputParam("timesteps", type_hint=torch.Tensor, required=True, description="Timesteps for inference"),
+    "num_inference_steps": InputParam(
+        "num_inference_steps", type_hint=int, required=True, description="Number of denoising steps"
+    ),
    "latent_timestep": InputParam(
        "latent_timestep", type_hint=torch.Tensor, required=True, description="Initial noise level timestep"
    ),
@@ -0,0 +1,66 @@
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    DIFFUSERS_SLOW_IMPORT,
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    get_objects_from_module,
+    is_torch_available,
+    is_transformers_available,
+)
+
+
+_dummy_objects = {}
+_import_structure = {}
+
+try:
+    if not (is_transformers_available() and is_torch_available()):
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    from ...utils import dummy_torch_and_transformers_objects  # noqa F403
+
+    _dummy_objects.update(get_objects_from_module(dummy_torch_and_transformers_objects))
+else:
+    _import_structure["encoders"] = ["WanTextEncoderStep"]
+    _import_structure["modular_blocks"] = [
+        "ALL_BLOCKS",
+        "AUTO_BLOCKS",
+        "TEXT2VIDEO_BLOCKS",
+        "WanAutoBeforeDenoiseStep",
+        "WanAutoBlocks",
+        "WanAutoBlocks",
+        "WanAutoDecodeStep",
+        "WanAutoDenoiseStep",
+    ]
+    _import_structure["modular_pipeline"] = ["WanModularPipeline"]
+
+if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
+    try:
+        if not (is_transformers_available() and is_torch_available()):
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        from ...utils.dummy_torch_and_transformers_objects import *  # noqa F403
+    else:
+        from .encoders import WanTextEncoderStep
+        from .modular_blocks import (
+            ALL_BLOCKS,
+            AUTO_BLOCKS,
+            TEXT2VIDEO_BLOCKS,
+            WanAutoBeforeDenoiseStep,
+            WanAutoBlocks,
+            WanAutoDecodeStep,
+            WanAutoDenoiseStep,
+        )
+        from .modular_pipeline import WanModularPipeline
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()["__file__"],
+        _import_structure,
+        module_spec=__spec__,
+    )
+
+    for name, value in _dummy_objects.items():
+        setattr(sys.modules[__name__], name, value)
@@ -0,0 +1,365 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from typing import List, Optional, Union
+
+import torch
+
+from ...schedulers import UniPCMultistepScheduler
+from ...utils import logging
+from ...utils.torch_utils import randn_tensor
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
+from .modular_pipeline import WanModularPipeline
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+# TODO(yiyi, aryan): We need another step before text encoder to set the `num_inference_steps` attribute for guider so that
+# things like when to do guidance and how many conditions to be prepared can be determined. Currently, this is done by
+# always assuming you want to do guidance in the Guiders. So, negative embeddings are prepared regardless of what the
+# configuration of guider is.
+
+
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
+def retrieve_timesteps(
+    scheduler,
+    num_inference_steps: Optional[int] = None,
+    device: Optional[Union[str, torch.device]] = None,
+    timesteps: Optional[List[int]] = None,
+    sigmas: Optional[List[float]] = None,
+    **kwargs,
+):
+    r"""
+    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
+    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
+
+    Args:
+        scheduler (`SchedulerMixin`):
+            The scheduler to get timesteps from.
+        num_inference_steps (`int`):
+            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
+            must be `None`.
+        device (`str` or `torch.device`, *optional*):
+            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
+        timesteps (`List[int]`, *optional*):
+            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
+            `num_inference_steps` and `sigmas` must be `None`.
+        sigmas (`List[float]`, *optional*):
+            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
+            `num_inference_steps` and `timesteps` must be `None`.
+
+    Returns:
+        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
+        second element is the number of inference steps.
+    """
+    if timesteps is not None and sigmas is not None:
+        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
+    if timesteps is not None:
+        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accepts_timesteps:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" timestep schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    elif sigmas is not None:
+        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
+        if not accept_sigmas:
+            raise ValueError(
+                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
+                f" sigmas schedules. Please check whether you are using the correct scheduler."
+            )
+        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+        num_inference_steps = len(timesteps)
+    else:
+        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
+        timesteps = scheduler.timesteps
+    return timesteps, num_inference_steps
+
+
+class WanInputStep(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Input processing step that:\n"
+            "  1. Determines `batch_size` and `dtype` based on `prompt_embeds`\n"
+            "  2. Adjusts input tensor shapes based on `batch_size` (number of prompts) and `num_videos_per_prompt`\n\n"
+            "All input tensors are expected to have either batch_size=1 or match the batch_size\n"
+            "of prompt_embeds. The tensors will be duplicated across the batch dimension to\n"
+            "have a final batch_size of batch_size * num_videos_per_prompt."
+        )
+
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam("num_videos_per_prompt", default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
+            InputParam(
+                "prompt_embeds",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Pre-generated text embeddings. Can be generated from text_encoder step.",
+            ),
+            InputParam(
+                "negative_prompt_embeds",
+                type_hint=torch.Tensor,
+                description="Pre-generated negative text embeddings. Can be generated from text_encoder step.",
+            ),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> List[str]:
+        return [
+            OutputParam(
+                "batch_size",
+                type_hint=int,
+                description="Number of prompts, the final batch size of model inputs should be batch_size * num_videos_per_prompt",
+            ),
+            OutputParam(
+                "dtype",
+                type_hint=torch.dtype,
+                description="Data type of model tensor inputs (determined by `prompt_embeds`)",
+            ),
+            OutputParam(
+                "prompt_embeds",
+                type_hint=torch.Tensor,
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
+                description="text embeddings used to guide the image generation",
+            ),
+            OutputParam(
+                "negative_prompt_embeds",
+                type_hint=torch.Tensor,
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
+                description="negative text embeddings used to guide the image generation",
+            ),
+        ]
+
+    def check_inputs(self, components, block_state):
+        if block_state.prompt_embeds is not None and block_state.negative_prompt_embeds is not None:
+            if block_state.prompt_embeds.shape != block_state.negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {block_state.prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {block_state.negative_prompt_embeds.shape}."
+                )
+
+    @torch.no_grad()
+    def __call__(self, components: WanModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+        self.check_inputs(components, block_state)
+
+        block_state.batch_size = block_state.prompt_embeds.shape[0]
+        block_state.dtype = block_state.prompt_embeds.dtype
+
+        _, seq_len, _ = block_state.prompt_embeds.shape
+        block_state.prompt_embeds = block_state.prompt_embeds.repeat(1, block_state.num_videos_per_prompt, 1)
+        block_state.prompt_embeds = block_state.prompt_embeds.view(
+            block_state.batch_size * block_state.num_videos_per_prompt, seq_len, -1
+        )
+
+        if block_state.negative_prompt_embeds is not None:
+            _, seq_len, _ = block_state.negative_prompt_embeds.shape
+            block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.repeat(
+                1, block_state.num_videos_per_prompt, 1
+            )
+            block_state.negative_prompt_embeds = block_state.negative_prompt_embeds.view(
+                block_state.batch_size * block_state.num_videos_per_prompt, seq_len, -1
+            )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
+
+
+class WanSetTimestepsStep(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("scheduler", UniPCMultistepScheduler),
+        ]
+
+    @property
+    def description(self) -> str:
+        return "Step that sets the scheduler's timesteps for inference"
+
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam("num_inference_steps", default=50),
+            InputParam("timesteps"),
+            InputParam("sigmas"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam("timesteps", type_hint=torch.Tensor, description="The timesteps to use for inference"),
+            OutputParam(
+                "num_inference_steps",
+                type_hint=int,
+                description="The number of denoising steps to perform at inference time",
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: WanModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+        block_state.device = components._execution_device
+
+        block_state.timesteps, block_state.num_inference_steps = retrieve_timesteps(
+            components.scheduler,
+            block_state.num_inference_steps,
+            block_state.device,
+            block_state.timesteps,
+            block_state.sigmas,
+        )
+
+        self.set_block_state(state, block_state)
+        return components, state
+
+
+class WanPrepareLatentsStep(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return []
+
+    @property
+    def description(self) -> str:
+        return "Prepare latents step that prepares the latents for the text-to-video generation process"
+
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam("height", type_hint=int),
+            InputParam("width", type_hint=int),
+            InputParam("num_frames", type_hint=int),
+            InputParam("latents", type_hint=Optional[torch.Tensor]),
+            InputParam("num_videos_per_prompt", type_hint=int, default=1),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[InputParam]:
+        return [
+            InputParam("generator"),
+            InputParam(
+                "batch_size",
+                required=True,
+                type_hint=int,
+                description="Number of prompts, the final batch size of model inputs should be `batch_size * num_videos_per_prompt`. Can be generated in input step.",
+            ),
+            InputParam("dtype", type_hint=torch.dtype, description="The dtype of the model inputs"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "latents", type_hint=torch.Tensor, description="The initial latents to use for the denoising process"
+            )
+        ]
+
+    @staticmethod
+    def check_inputs(components, block_state):
+        if (block_state.height is not None and block_state.height % components.vae_scale_factor_spatial != 0) or (
+            block_state.width is not None and block_state.width % components.vae_scale_factor_spatial != 0
+        ):
+            raise ValueError(
+                f"`height` and `width` have to be divisible by {components.vae_scale_factor_spatial} but are {block_state.height} and {block_state.width}."
+            )
+        if block_state.num_frames is not None and (
+            block_state.num_frames < 1 or (block_state.num_frames - 1) % components.vae_scale_factor_temporal != 0
+        ):
+            raise ValueError(
+                f"`num_frames` has to be greater than 0, and (num_frames - 1) must be divisible by {components.vae_scale_factor_temporal}, but got {block_state.num_frames}."
+            )
+
+    @staticmethod
+    # Copied from diffusers.pipelines.wan.pipeline_wan.WanPipeline.prepare_latents with self->comp
+    def prepare_latents(
+        comp,
+        batch_size: int,
+        num_channels_latents: int = 16,
+        height: int = 480,
+        width: int = 832,
+        num_frames: int = 81,
+        dtype: Optional[torch.dtype] = None,
+        device: Optional[torch.device] = None,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        if latents is not None:
+            return latents.to(device=device, dtype=dtype)
+
+        num_latent_frames = (num_frames - 1) // comp.vae_scale_factor_temporal + 1
+        shape = (
+            batch_size,
+            num_channels_latents,
+            num_latent_frames,
+            int(height) // comp.vae_scale_factor_spatial,
+            int(width) // comp.vae_scale_factor_spatial,
+        )
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+
+        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        return latents
+
+    @torch.no_grad()
+    def __call__(self, components: WanModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        block_state.height = block_state.height or components.default_height
+        block_state.width = block_state.width or components.default_width
+        block_state.num_frames = block_state.num_frames or components.default_num_frames
+        block_state.device = components._execution_device
+        block_state.dtype = torch.float32  # Wan latents should be torch.float32 for best quality
+        block_state.num_channels_latents = components.num_channels_latents
+
+        self.check_inputs(components, block_state)
+
+        block_state.latents = self.prepare_latents(
+            components,
+            block_state.batch_size * block_state.num_videos_per_prompt,
+            block_state.num_channels_latents,
+            block_state.height,
+            block_state.width,
+            block_state.num_frames,
+            block_state.dtype,
+            block_state.device,
+            block_state.generator,
+            block_state.latents,
+        )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
@@ -0,0 +1,105 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, List, Tuple, Union
+
+import numpy as np
+import PIL
+import torch
+
+from ...configuration_utils import FrozenDict
+from ...models import AutoencoderKLWan
+from ...utils import logging
+from ...video_processor import VideoProcessor
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+class WanDecodeStep(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("vae", AutoencoderKLWan),
+            ComponentSpec(
+                "video_processor",
+                VideoProcessor,
+                config=FrozenDict({"vae_scale_factor": 8}),
+                default_creation_method="from_config",
+            ),
+        ]
+
+    @property
+    def description(self) -> str:
+        return "Step that decodes the denoised latents into images"
+
+    @property
+    def inputs(self) -> List[Tuple[str, Any]]:
+        return [
+            InputParam("output_type", default="pil"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
+            InputParam(
+                "latents",
+                required=True,
+                type_hint=torch.Tensor,
+                description="The denoised latents from the denoising step",
+            )
+        ]
+
+    @property
+    def intermediate_outputs(self) -> List[str]:
+        return [
+            OutputParam(
+                "videos",
+                type_hint=Union[List[List[PIL.Image.Image]], List[torch.Tensor], List[np.ndarray]],
+                description="The generated videos, can be a PIL.Image.Image, torch.Tensor or a numpy array",
+            )
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+        vae_dtype = components.vae.dtype
+
+        if not block_state.output_type == "latent":
+            latents = block_state.latents
+            latents_mean = (
+                torch.tensor(components.vae.config.latents_mean)
+                .view(1, components.vae.config.z_dim, 1, 1, 1)
+                .to(latents.device, latents.dtype)
+            )
+            latents_std = 1.0 / torch.tensor(components.vae.config.latents_std).view(
+                1, components.vae.config.z_dim, 1, 1, 1
+            ).to(latents.device, latents.dtype)
+            latents = latents / latents_std + latents_mean
+            latents = latents.to(vae_dtype)
+            block_state.videos = components.vae.decode(latents, return_dict=False)[0]
+        else:
+            block_state.videos = block_state.latents
+
+        block_state.videos = components.video_processor.postprocess_video(
+            block_state.videos, output_type=block_state.output_type
+        )
+
+        self.set_block_state(state, block_state)
+
+        return components, state
@@ -0,0 +1,261 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Any, List, Tuple
+
+import torch
+
+from ...configuration_utils import FrozenDict
+from ...guiders import ClassifierFreeGuidance
+from ...models import WanTransformer3DModel
+from ...schedulers import UniPCMultistepScheduler
+from ...utils import logging
+from ..modular_pipeline import (
+    BlockState,
+    LoopSequentialPipelineBlocks,
+    ModularPipelineBlocks,
+    PipelineState,
+)
+from ..modular_pipeline_utils import ComponentSpec, InputParam, OutputParam
+from .modular_pipeline import WanModularPipeline
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+class WanLoopDenoiser(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec(
+                "guider",
+                ClassifierFreeGuidance,
+                config=FrozenDict({"guidance_scale": 5.0}),
+                default_creation_method="from_config",
+            ),
+            ComponentSpec("transformer", WanTransformer3DModel),
+        ]
+
+    @property
+    def description(self) -> str:
+        return (
+            "Step within the denoising loop that denoise the latents with guidance. "
+            "This block should be used to compose the `sub_blocks` attribute of a `LoopSequentialPipelineBlocks` "
+            "object (e.g. `WanDenoiseLoopWrapper`)"
+        )
+
+    @property
+    def inputs(self) -> List[Tuple[str, Any]]:
+        return [
+            InputParam("attention_kwargs"),
+        ]
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
+            InputParam(
+                "latents",
+                required=True,
+                type_hint=torch.Tensor,
+                description="The initial latents to use for the denoising process. Can be generated in prepare_latent step.",
+            ),
+            InputParam(
+                "num_inference_steps",
+                required=True,
+                type_hint=int,
+                description="The number of inference steps to use for the denoising process. Can be generated in set_timesteps step.",
+            ),
+            InputParam(
+                kwargs_type="guider_input_fields",
+                description=(
+                    "All conditional model inputs that need to be prepared with guider. "
+                    "It should contain prompt_embeds/negative_prompt_embeds. "
+                    "Please add `kwargs_type=guider_input_fields` to their parameter spec (`OutputParam`) when they are created and added to the pipeline state"
+                ),
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(
+        self, components: WanModularPipeline, block_state: BlockState, i: int, t: torch.Tensor
+    ) -> PipelineState:
+        #  Map the keys we'll see on each `guider_state_batch` (e.g. guider_state_batch.prompt_embeds)
+        #  to the corresponding (cond, uncond) fields on block_state. (e.g. block_state.prompt_embeds, block_state.negative_prompt_embeds)
+        guider_input_fields = {
+            "prompt_embeds": ("prompt_embeds", "negative_prompt_embeds"),
+        }
+        transformer_dtype = components.transformer.dtype
+
+        components.guider.set_state(step=i, num_inference_steps=block_state.num_inference_steps, timestep=t)
+
+        # Prepare mini‐batches according to guidance method and `guider_input_fields`
+        # Each guider_state_batch will have .prompt_embeds, .time_ids, text_embeds, image_embeds.
+        # e.g. for CFG, we prepare two batches: one for uncond, one for cond
+        # for first batch, guider_state_batch.prompt_embeds correspond to block_state.prompt_embeds
+        # for second batch, guider_state_batch.prompt_embeds correspond to block_state.negative_prompt_embeds
+        guider_state = components.guider.prepare_inputs(block_state, guider_input_fields)
+
+        # run the denoiser for each guidance batch
+        for guider_state_batch in guider_state:
+            components.guider.prepare_models(components.transformer)
+            cond_kwargs = guider_state_batch.as_dict()
+            cond_kwargs = {k: v for k, v in cond_kwargs.items() if k in guider_input_fields}
+            prompt_embeds = cond_kwargs.pop("prompt_embeds")
+
+            # Predict the noise residual
+            # store the noise_pred in guider_state_batch so that we can apply guidance across all batches
+            guider_state_batch.noise_pred = components.transformer(
+                hidden_states=block_state.latents.to(transformer_dtype),
+                timestep=t.flatten(),
+                encoder_hidden_states=prompt_embeds,
+                attention_kwargs=block_state.attention_kwargs,
+                return_dict=False,
+            )[0]
+            components.guider.cleanup_models(components.transformer)
+
+        # Perform guidance
+        block_state.noise_pred, block_state.scheduler_step_kwargs = components.guider(guider_state)
+
+        return components, block_state
+
+
+class WanLoopAfterDenoiser(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("scheduler", UniPCMultistepScheduler),
+        ]
+
+    @property
+    def description(self) -> str:
+        return (
+            "step within the denoising loop that update the latents. "
+            "This block should be used to compose the `sub_blocks` attribute of a `LoopSequentialPipelineBlocks` "
+            "object (e.g. `WanDenoiseLoopWrapper`)"
+        )
+
+    @property
+    def inputs(self) -> List[Tuple[str, Any]]:
+        return []
+
+    @property
+    def intermediate_inputs(self) -> List[str]:
+        return [
+            InputParam("generator"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [OutputParam("latents", type_hint=torch.Tensor, description="The denoised latents")]
+
+    @torch.no_grad()
+    def __call__(self, components: WanModularPipeline, block_state: BlockState, i: int, t: torch.Tensor):
+        # Perform scheduler step using the predicted output
+        latents_dtype = block_state.latents.dtype
+        block_state.latents = components.scheduler.step(
+            block_state.noise_pred.float(),
+            t,
+            block_state.latents.float(),
+            **block_state.scheduler_step_kwargs,
+            return_dict=False,
+        )[0]
+
+        if block_state.latents.dtype != latents_dtype:
+            block_state.latents = block_state.latents.to(latents_dtype)
+
+        return components, block_state
+
+
+class WanDenoiseLoopWrapper(LoopSequentialPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def description(self) -> str:
+        return (
+            "Pipeline block that iteratively denoise the latents over `timesteps`. "
+            "The specific steps with each iteration can be customized with `sub_blocks` attributes"
+        )
+
+    @property
+    def loop_expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec(
+                "guider",
+                ClassifierFreeGuidance,
+                config=FrozenDict({"guidance_scale": 5.0}),
+                default_creation_method="from_config",
+            ),
+            ComponentSpec("scheduler", UniPCMultistepScheduler),
+            ComponentSpec("transformer", WanTransformer3DModel),
+        ]
+
+    @property
+    def loop_intermediate_inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "timesteps",
+                required=True,
+                type_hint=torch.Tensor,
+                description="The timesteps to use for the denoising process. Can be generated in set_timesteps step.",
+            ),
+            InputParam(
+                "num_inference_steps",
+                required=True,
+                type_hint=int,
+                description="The number of inference steps to use for the denoising process. Can be generated in set_timesteps step.",
+            ),
+        ]
+
+    @torch.no_grad()
+    def __call__(self, components: WanModularPipeline, state: PipelineState) -> PipelineState:
+        block_state = self.get_block_state(state)
+
+        block_state.num_warmup_steps = max(
+            len(block_state.timesteps) - block_state.num_inference_steps * components.scheduler.order, 0
+        )
+
+        with self.progress_bar(total=block_state.num_inference_steps) as progress_bar:
+            for i, t in enumerate(block_state.timesteps):
+                components, block_state = self.loop_step(components, block_state, i=i, t=t)
+                if i == len(block_state.timesteps) - 1 or (
+                    (i + 1) > block_state.num_warmup_steps and (i + 1) % components.scheduler.order == 0
+                ):
+                    progress_bar.update()
+
+        self.set_block_state(state, block_state)
+
+        return components, state
+
+
+class WanDenoiseStep(WanDenoiseLoopWrapper):
+    block_classes = [
+        WanLoopDenoiser,
+        WanLoopAfterDenoiser,
+    ]
+    block_names = ["before_denoiser", "denoiser", "after_denoiser"]
+
+    @property
+    def description(self) -> str:
+        return (
+            "Denoise step that iteratively denoise the latents. \n"
+            "Its loop logic is defined in `WanDenoiseLoopWrapper.__call__` method \n"
+            "At each iteration, it runs blocks defined in `sub_blocks` sequencially:\n"
+            " - `WanLoopDenoiser`\n"
+            " - `WanLoopAfterDenoiser`\n"
+            "This block supports both text2vid tasks."
+        )
@@ -0,0 +1,242 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import html
+from typing import List, Optional, Union
+
+import regex as re
+import torch
+from transformers import AutoTokenizer, UMT5EncoderModel
+
+from ...configuration_utils import FrozenDict
+from ...guiders import ClassifierFreeGuidance
+from ...utils import is_ftfy_available, logging
+from ..modular_pipeline import ModularPipelineBlocks, PipelineState
+from ..modular_pipeline_utils import ComponentSpec, ConfigSpec, InputParam, OutputParam
+from .modular_pipeline import WanModularPipeline
+
+
+if is_ftfy_available():
+    import ftfy
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+
+
+def whitespace_clean(text):
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    return text
+
+
+def prompt_clean(text):
+    text = whitespace_clean(basic_clean(text))
+    return text
+
+
+class WanTextEncoderStep(ModularPipelineBlocks):
+    model_name = "wan"
+
+    @property
+    def description(self) -> str:
+        return "Text Encoder step that generate text_embeddings to guide the video generation"
+
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("text_encoder", UMT5EncoderModel),
+            ComponentSpec("tokenizer", AutoTokenizer),
+            ComponentSpec(
+                "guider",
+                ClassifierFreeGuidance,
+                config=FrozenDict({"guidance_scale": 5.0}),
+                default_creation_method="from_config",
+            ),
+        ]
+
+    @property
+    def expected_configs(self) -> List[ConfigSpec]:
+        return []
+
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam("prompt"),
+            InputParam("negative_prompt"),
+            InputParam("attention_kwargs"),
+        ]
+
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "prompt_embeds",
+                type_hint=torch.Tensor,
+                kwargs_type="guider_input_fields",
+                description="text embeddings used to guide the image generation",
+            ),
+            OutputParam(
+                "negative_prompt_embeds",
+                type_hint=torch.Tensor,
+                kwargs_type="guider_input_fields",
+                description="negative text embeddings used to guide the image generation",
+            ),
+        ]
+
+    @staticmethod
+    def check_inputs(block_state):
+        if block_state.prompt is not None and (
+            not isinstance(block_state.prompt, str) and not isinstance(block_state.prompt, list)
+        ):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(block_state.prompt)}")
+
+    @staticmethod
+    def _get_t5_prompt_embeds(
+        components,
+        prompt: Union[str, List[str]],
+        max_sequence_length: int,
+        device: torch.device,
+    ):
+        dtype = components.text_encoder.dtype
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        prompt = [prompt_clean(u) for u in prompt]
+
+        text_inputs = components.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            add_special_tokens=True,
+            return_attention_mask=True,
+            return_tensors="pt",
+        )
+        text_input_ids, mask = text_inputs.input_ids, text_inputs.attention_mask
+        seq_lens = mask.gt(0).sum(dim=1).long()
+        prompt_embeds = components.text_encoder(text_input_ids.to(device), mask.to(device)).last_hidden_state
+        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
+        prompt_embeds = [u[:v] for u, v in zip(prompt_embeds, seq_lens)]
+        prompt_embeds = torch.stack(
+            [torch.cat([u, u.new_zeros(max_sequence_length - u.size(0), u.size(1))]) for u in prompt_embeds], dim=0
+        )
+
+        return prompt_embeds
+
+    @staticmethod
+    def encode_prompt(
+        components,
+        prompt: str,
+        device: Optional[torch.device] = None,
+        num_videos_per_prompt: int = 1,
+        prepare_unconditional_embeds: bool = True,
+        negative_prompt: Optional[str] = None,
+        prompt_embeds: Optional[torch.Tensor] = None,
+        negative_prompt_embeds: Optional[torch.Tensor] = None,
+        max_sequence_length: int = 512,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                prompt to be encoded
+            device: (`torch.device`):
+                torch device
+            num_videos_per_prompt (`int`):
+                number of videos that should be generated per prompt
+            prepare_unconditional_embeds (`bool`):
+                whether to use prepare unconditional embeddings or not
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.Tensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            max_sequence_length (`int`, defaults to `512`):
+                The maximum number of text tokens to be used for the generation process.
+        """
+        device = device or components._execution_device
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        batch_size = len(prompt) if prompt is not None else prompt_embeds.shape[0]
+
+        if prompt_embeds is None:
+            prompt_embeds = WanTextEncoderStep._get_t5_prompt_embeds(components, prompt, max_sequence_length, device)
+
+        if prepare_unconditional_embeds and negative_prompt_embeds is None:
+            negative_prompt = negative_prompt or ""
+            negative_prompt = batch_size * [negative_prompt] if isinstance(negative_prompt, str) else negative_prompt
+
+            if prompt is not None and type(prompt) is not type(negative_prompt):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            elif batch_size != len(negative_prompt):
+                raise ValueError(
+                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )
+
+            negative_prompt_embeds = WanTextEncoderStep._get_t5_prompt_embeds(
+                components, negative_prompt, max_sequence_length, device
+            )
+
+        bs_embed, seq_len, _ = prompt_embeds.shape
+        prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(bs_embed * num_videos_per_prompt, seq_len, -1)
+
+        if prepare_unconditional_embeds:
+            negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_videos_per_prompt, 1)
+            negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_videos_per_prompt, seq_len, -1)
+
+        return prompt_embeds, negative_prompt_embeds
+
+    @torch.no_grad()
+    def __call__(self, components: WanModularPipeline, state: PipelineState) -> PipelineState:
+        # Get inputs and intermediates
+        block_state = self.get_block_state(state)
+        self.check_inputs(block_state)
+
+        block_state.prepare_unconditional_embeds = components.guider.num_conditions > 1
+        block_state.device = components._execution_device
+
+        # Encode input prompt
+        (
+            block_state.prompt_embeds,
+            block_state.negative_prompt_embeds,
+        ) = self.encode_prompt(
+            components,
+            block_state.prompt,
+            block_state.device,
+            1,
+            block_state.prepare_unconditional_embeds,
+            block_state.negative_prompt,
+            prompt_embeds=None,
+            negative_prompt_embeds=None,
+        )
+
+        # Add outputs
+        self.set_block_state(state, block_state)
+        return components, state
@@ -0,0 +1,144 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from ...utils import logging
+from ..modular_pipeline import AutoPipelineBlocks, SequentialPipelineBlocks
+from ..modular_pipeline_utils import InsertableDict
+from .before_denoise import (
+    WanInputStep,
+    WanPrepareLatentsStep,
+    WanSetTimestepsStep,
+)
+from .decoders import WanDecodeStep
+from .denoise import WanDenoiseStep
+from .encoders import WanTextEncoderStep
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+# before_denoise: text2vid
+class WanBeforeDenoiseStep(SequentialPipelineBlocks):
+    block_classes = [
+        WanInputStep,
+        WanSetTimestepsStep,
+        WanPrepareLatentsStep,
+    ]
+    block_names = ["input", "set_timesteps", "prepare_latents"]
+
+    @property
+    def description(self):
+        return (
+            "Before denoise step that prepare the inputs for the denoise step.\n"
+            + "This is a sequential pipeline blocks:\n"
+            + " - `WanInputStep` is used to adjust the batch size of the model inputs\n"
+            + " - `WanSetTimestepsStep` is used to set the timesteps\n"
+            + " - `WanPrepareLatentsStep` is used to prepare the latents\n"
+        )
+
+
+# before_denoise: all task (text2vid,)
+class WanAutoBeforeDenoiseStep(AutoPipelineBlocks):
+    block_classes = [
+        WanBeforeDenoiseStep,
+    ]
+    block_names = ["text2vid"]
+    block_trigger_inputs = [None]
+
+    @property
+    def description(self):
+        return (
+            "Before denoise step that prepare the inputs for the denoise step.\n"
+            + "This is an auto pipeline block that works for text2vid.\n"
+            + " - `WanBeforeDenoiseStep` (text2vid) is used.\n"
+        )
+
+
+# denoise: text2vid
+class WanAutoDenoiseStep(AutoPipelineBlocks):
+    block_classes = [
+        WanDenoiseStep,
+    ]
+    block_names = ["denoise"]
+    block_trigger_inputs = [None]
+
+    @property
+    def description(self) -> str:
+        return (
+            "Denoise step that iteratively denoise the latents. "
+            "This is a auto pipeline block that works for text2vid tasks.."
+            " - `WanDenoiseStep` (denoise) for text2vid tasks."
+        )
+
+
+# decode: all task (text2img, img2img, inpainting)
+class WanAutoDecodeStep(AutoPipelineBlocks):
+    block_classes = [WanDecodeStep]
+    block_names = ["non-inpaint"]
+    block_trigger_inputs = [None]
+
+    @property
+    def description(self):
+        return "Decode step that decode the denoised latents into videos outputs.\n - `WanDecodeStep`"
+
+
+# text2vid
+class WanAutoBlocks(SequentialPipelineBlocks):
+    block_classes = [
+        WanTextEncoderStep,
+        WanAutoBeforeDenoiseStep,
+        WanAutoDenoiseStep,
+        WanAutoDecodeStep,
+    ]
+    block_names = [
+        "text_encoder",
+        "before_denoise",
+        "denoise",
+        "decoder",
+    ]
+
+    @property
+    def description(self):
+        return (
+            "Auto Modular pipeline for text-to-video using Wan.\n"
+            + "- for text-to-video generation, all you need to provide is `prompt`"
+        )
+
+
+TEXT2VIDEO_BLOCKS = InsertableDict(
+    [
+        ("text_encoder", WanTextEncoderStep),
+        ("input", WanInputStep),
+        ("set_timesteps", WanSetTimestepsStep),
+        ("prepare_latents", WanPrepareLatentsStep),
+        ("denoise", WanDenoiseStep),
+        ("decode", WanDecodeStep),
+    ]
+)
+
+
+AUTO_BLOCKS = InsertableDict(
+    [
+        ("text_encoder", WanTextEncoderStep),
+        ("before_denoise", WanAutoBeforeDenoiseStep),
+        ("denoise", WanAutoDenoiseStep),
+        ("decode", WanAutoDecodeStep),
+    ]
+)
+
+
+ALL_BLOCKS = {
+    "text2video": TEXT2VIDEO_BLOCKS,
+    "auto": AUTO_BLOCKS,
+}
@@ -0,0 +1,90 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+from ...loaders import WanLoraLoaderMixin
+from ...pipelines.pipeline_utils import StableDiffusionMixin
+from ...utils import logging
+from ..modular_pipeline import ModularPipeline
+
+
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+
+
+class WanModularPipeline(
+    ModularPipeline,
+    StableDiffusionMixin,
+    WanLoraLoaderMixin,
+):
+    """
+    A ModularPipeline for Wan.
+
+    <Tip warning={true}>
+
+        This is an experimental feature and is likely to change in the future.
+
+    </Tip>
+    """
+
+    @property
+    def default_height(self):
+        return self.default_sample_height * self.vae_scale_factor_spatial
+
+    @property
+    def default_width(self):
+        return self.default_sample_width * self.vae_scale_factor_spatial
+
+    @property
+    def default_num_frames(self):
+        return (self.default_sample_num_frames - 1) * self.vae_scale_factor_temporal + 1
+
+    @property
+    def default_sample_height(self):
+        return 60
+
+    @property
+    def default_sample_width(self):
+        return 104
+
+    @property
+    def default_sample_num_frames(self):
+        return 21
+
+    @property
+    def vae_scale_factor_spatial(self):
+        vae_scale_factor = 8
+        if hasattr(self, "vae") and self.vae is not None:
+            vae_scale_factor = 2 ** len(self.vae.temperal_downsample)
+        return vae_scale_factor
+
+    @property
+    def vae_scale_factor_temporal(self):
+        vae_scale_factor = 4
+        if hasattr(self, "vae") and self.vae is not None:
+            vae_scale_factor = 2 ** sum(self.vae.temperal_downsample)
+        return vae_scale_factor
+
+    @property
+    def num_channels_transformer(self):
+        num_channels_transformer = 16
+        if hasattr(self, "transformer") and self.transformer is not None:
+            num_channels_transformer = self.transformer.config.in_channels
+        return num_channels_transformer
+
+    @property
+    def num_channels_latents(self):
+        num_channels_latents = 16
+        if hasattr(self, "vae") and self.vae is not None:
+            num_channels_latents = self.vae.config.z_dim
+        return num_channels_latents
@@ -663,11 +663,11 @@ class ChromaPipeline(
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 3.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
+                Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -725,11 +725,11 @@ class ChromaImg2ImgPipeline(
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 5.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
+                Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            strength (`float, *optional*, defaults to 0.9):
                Conceptually, indicates how much to transform the reference image. Must be between 0 and 1. image will
                be used as a starting point, adding more noise to it the larger the strength. The number of denoising
@@ -674,7 +674,8 @@ class FluxPipeline(
                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
                `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
            true_cfg_scale (`float`, *optional*, defaults to 1.0):
-                When > 1.0 and a provided `negative_prompt`, enables true classifier-free guidance.
+                True classifier-free guidance (guidance scale) is enabled when `true_cfg_scale` > 1 and
+                `negative_prompt` is provided.
            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
                The height in pixels of the generated image. This is set to 1024 by default for the best results.
            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
@@ -687,11 +688,11 @@ class FluxPipeline(
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 3.5):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality.
+                Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -661,11 +661,11 @@ class FluxControlPipeline(
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 3.5):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality.
+                Embedded guidance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with prompt at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifier-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -795,11 +795,11 @@ class FluxKontextPipeline(
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 3.5):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality.
+                Embedded guidance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with prompt at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifier-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -989,7 +989,8 @@ class FluxKontextInpaintPipeline(
                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
                `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
            true_cfg_scale (`float`, *optional*, defaults to 1.0):
-                When > 1.0 and a provided `negative_prompt`, enables true classifier-free guidance.
+                True classifier-free guidance (guidance scale) is enabled when `true_cfg_scale` > 1 and
+                `negative_prompt` is provided.
            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
                The height in pixels of the generated image. This is set to 1024 by default for the best results.
            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
@@ -1015,11 +1016,11 @@ class FluxKontextInpaintPipeline(
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 3.5):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality.
+                Embedded guidance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifier-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -763,11 +763,11 @@ class HiDreamImagePipeline(DiffusionPipeline, HiDreamImageLoraLoaderMixin):
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 3.5):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality.
+                Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            negative_prompt (`str` or `List[str]`, *optional*):
                The prompt or prompts not to guide the image generation. If not defined, one has to pass
                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `true_cfg_scale` is
@@ -529,15 +529,14 @@ class HunyuanVideoPipeline(DiffusionPipeline, HunyuanVideoLoraLoaderMixin):
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            true_cfg_scale (`float`, *optional*, defaults to 1.0):
-                When > 1.0 and a provided `negative_prompt`, enables true classifier-free guidance.
+                True classifier-free guidance (guidance scale) is enabled when `true_cfg_scale` > 1 and
+                `negative_prompt` is provided.
            guidance_scale (`float`, defaults to `6.0`):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality. Note that the only available
-                HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and
-                conditional latent is not applied.
+                Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_videos_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -643,11 +643,11 @@ class SanaSprintPipeline(DiffusionPipeline, SanaLoraLoaderMixin):
                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
                passed will be used. Must be in descending order.
            guidance_scale (`float`, *optional*, defaults to 4.5):
-                Guidance scale as defined in [Classifier-Free Diffusion
-                Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
-                of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
-                `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
-                the text `prompt`, usually at the expense of lower image quality.
+                Embedded guiddance scale is enabled by setting `guidance_scale` > 1. Higher `guidance_scale` encourages
+                a model to generate images more aligned with `prompt` at the expense of lower image quality.
+
+                Guidance-distilled models approximates true classifer-free guidance for `guidance_scale` > 1. Refer to
+                the [paper](https://huggingface.co/papers/2210.03142) to learn more.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
            height (`int`, *optional*, defaults to self.unet.config.sample_size):
@@ -32,6 +32,36 @@ class StableDiffusionXLModularPipeline(metaclass=DummyObject):
        requires_backends(cls, ["torch", "transformers"])


+class WanAutoBlocks(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
+class WanModularPipeline(metaclass=DummyObject):
+    _backends = ["torch", "transformers"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch", "transformers"])
+
+    @classmethod
+    def from_config(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        requires_backends(cls, ["torch", "transformers"])
+
+
 class AllegroPipeline(metaclass=DummyObject):
    _backends = ["torch", "transformers"]

@@ -75,7 +75,6 @@ from diffusers.utils.testing_utils import (
    require_torch_2,
    require_torch_accelerator,
    require_torch_accelerator_with_training,
-    require_torch_gpu,
    require_torch_multi_accelerator,
    require_torch_version_greater,
    run_test_in_subprocess,
@@ -1829,8 +1828,8 @@ class ModelTesterMixin:

        assert msg_substring in str(err_ctx.exception)

-    @parameterized.expand([0, "cuda", torch.device("cuda")])
-    @require_torch_gpu
+    @parameterized.expand([0, torch_device, torch.device(torch_device)])
+    @require_torch_accelerator
    def test_passing_non_dict_device_map_works(self, device_map):
        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
        model = self.model_class(**init_dict).eval()
@@ -1839,8 +1838,8 @@ class ModelTesterMixin:
            loaded_model = self.model_class.from_pretrained(tmpdir, device_map=device_map)
            _ = loaded_model(**inputs_dict)

-    @parameterized.expand([("", "cuda"), ("", torch.device("cuda"))])
-    @require_torch_gpu
+    @parameterized.expand([("", torch_device), ("", torch.device(torch_device))])
+    @require_torch_accelerator
    def test_passing_dict_device_map_works(self, name, device):
        # There are other valid dict-based `device_map` values too. It's best to refer to
        # the docs for those: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference#the-devicemap.
@@ -1945,10 +1944,11 @@ class ModelPushToHubTester(unittest.TestCase):
        delete_repo(self.repo_id, token=TOKEN)


-@require_torch_gpu
+@require_torch_accelerator
@require_torch_2
@is_torch_compile
@slow
+@require_torch_version_greater("2.7.1")
 class TorchCompileTesterMixin:
    different_shapes_for_compilation = None

@@ -2013,7 +2013,7 @@ class TorchCompileTesterMixin:
        model.eval()
        # TODO: Can test for other group offloading kwargs later if needed.
        group_offload_kwargs = {
-            "onload_device": "cuda",
+            "onload_device": torch_device,
            "offload_device": "cpu",
            "offload_type": "block_level",
            "num_blocks_per_group": 1,
@@ -2047,6 +2047,7 @@ class TorchCompileTesterMixin:
@require_torch_accelerator
@require_peft_backend
@require_peft_version_greater("0.14.0")
+@require_torch_version_greater("2.7.1")
@is_torch_compile
 class LoraHotSwappingForModelTesterMixin:
    """Test that hotswapping does not result in recompilation on the model directly.
@@ -358,7 +358,7 @@ class UNet2DConditionModelTests(ModelTesterMixin, UNetTesterMixin, unittest.Test
    model_class = UNet2DConditionModel
    main_input_name = "sample"
    # We override the items here because the unet under consideration is small.
-    model_split_percents = [0.5, 0.3, 0.4]
+    model_split_percents = [0.5, 0.34, 0.4]

    @property
    def dummy_input(self):
@@ -1,488 +0,0 @@
-# coding=utf-8
-# Copyright 2025 HuggingFace Inc.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import random
-import tempfile
-import unittest
-from typing import Any, Dict
-
-import numpy as np
-import torch
-from PIL import Image
-
-from diffusers import (
-    ClassifierFreeGuidance,
-    ModularPipeline,
-    StableDiffusionXLAutoBlocks,
-    StableDiffusionXLModularPipeline,
-)
-from diffusers.loaders import ModularIPAdapterMixin
-from diffusers.utils.testing_utils import (
-    enable_full_determinism,
-    floats_tensor,
-    torch_device,
-)
-
-from ...models.unets.test_models_unet_2d_condition import (
-    create_ip_adapter_state_dict,
-)
-from ..test_modular_pipelines_common import (
-    ModularPipelineTesterMixin,
-)
-
-
-enable_full_determinism()
-
-
-class SDXLModularTests:
-    """
-    This mixin defines method to create pipeline, base input and base test across all SDXL modular tests.
-    """
-
-    pipeline_class = StableDiffusionXLModularPipeline
-    pipeline_blocks_class = StableDiffusionXLAutoBlocks
-    repo = "hf-internal-testing/tiny-sdxl-modular"
-    params = frozenset(
-        [
-            "prompt",
-            "height",
-            "width",
-            "negative_prompt",
-            "cross_attention_kwargs",
-            "image",
-            "mask_image",
-        ]
-    )
-    batch_params = frozenset(["prompt", "negative_prompt", "image", "mask_image"])
-
-    def get_pipeline(self, components_manager=None, torch_dtype=torch.float32):
-        pipeline = self.pipeline_blocks_class().init_pipeline(self.repo, components_manager=components_manager)
-        pipeline.load_default_components(torch_dtype=torch_dtype)
-        return pipeline
-
-    def get_dummy_inputs(self, device, seed=0):
-        if str(device).startswith("mps"):
-            generator = torch.manual_seed(seed)
-        else:
-            generator = torch.Generator(device=device).manual_seed(seed)
-        inputs = {
-            "prompt": "A painting of a squirrel eating a burger",
-            "generator": generator,
-            "num_inference_steps": 2,
-            "output_type": "np",
-        }
-        return inputs
-
-    def _test_stable_diffusion_xl_euler(self, expected_image_shape, expected_slice, expected_max_diff=1e-2):
-        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
-        sd_pipe = self.get_pipeline()
-        sd_pipe = sd_pipe.to(device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        inputs = self.get_dummy_inputs(device)
-        image = sd_pipe(**inputs, output="images")
-        image_slice = image[0, -3:, -3:, -1]
-
-        assert image.shape == expected_image_shape
-
-        assert np.abs(image_slice.flatten() - expected_slice).max() < expected_max_diff, (
-            "Image Slice does not match expected slice"
-        )
-
-
-class SDXLModularIPAdapterTests:
-    """
-    This mixin is designed to test IP Adapter.
-    """
-
-    def test_pipeline_inputs_and_blocks(self):
-        blocks = self.pipeline_blocks_class()
-        parameters = blocks.input_names
-
-        assert issubclass(self.pipeline_class, ModularIPAdapterMixin)
-        assert "ip_adapter_image" in parameters, (
-            "`ip_adapter_image` argument must be supported by the `__call__` method"
-        )
-        assert "ip_adapter" in blocks.sub_blocks, "pipeline must contain an IPAdapter block"
-
-        _ = blocks.sub_blocks.pop("ip_adapter")
-        parameters = blocks.input_names
-        assert "ip_adapter_image" not in parameters, (
-            "`ip_adapter_image` argument must be removed from the `__call__` method"
-        )
-        assert "ip_adapter_image_embeds" not in parameters, (
-            "`ip_adapter_image_embeds` argument must be supported by the `__call__` method"
-        )
-
-    def _get_dummy_image_embeds(self, cross_attention_dim: int = 32):
-        return torch.randn((1, 1, cross_attention_dim), device=torch_device)
-
-    def _get_dummy_faceid_image_embeds(self, cross_attention_dim: int = 32):
-        return torch.randn((1, 1, 1, cross_attention_dim), device=torch_device)
-
-    def _get_dummy_masks(self, input_size: int = 64):
-        _masks = torch.zeros((1, 1, input_size, input_size), device=torch_device)
-        _masks[0, :, :, : int(input_size / 2)] = 1
-        return _masks
-
-    def _modify_inputs_for_ip_adapter_test(self, inputs: Dict[str, Any]):
-        blocks = self.pipeline_blocks_class()
-        _ = blocks.sub_blocks.pop("ip_adapter")
-        parameters = blocks.input_names
-        if "image" in parameters and "strength" in parameters:
-            inputs["num_inference_steps"] = 4
-
-        inputs["output_type"] = "np"
-        return inputs
-
-    def test_ip_adapter(self, expected_max_diff: float = 1e-4, expected_pipe_slice=None):
-        r"""Tests for IP-Adapter.
-
-        The following scenarios are tested:
-          - Single IP-Adapter with scale=0 should produce same output as no IP-Adapter.
-          - Multi IP-Adapter with scale=0 should produce same output as no IP-Adapter.
-          - Single IP-Adapter with scale!=0 should produce different output compared to no IP-Adapter.
-          - Multi IP-Adapter with scale!=0 should produce different output compared to no IP-Adapter.
-        """
-        # Raising the tolerance for this test when it's run on a CPU because we
-        # compare against static slices and that can be shaky (with a VVVV low probability).
-        expected_max_diff = 9e-4 if torch_device == "cpu" else expected_max_diff
-
-        blocks = self.pipeline_blocks_class()
-        _ = blocks.sub_blocks.pop("ip_adapter")
-        pipe = blocks.init_pipeline(self.repo)
-        pipe.load_default_components(torch_dtype=torch.float32)
-        pipe = pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-        cross_attention_dim = pipe.unet.config.get("cross_attention_dim")
-
-        # forward pass without ip adapter
-        inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
-        if expected_pipe_slice is None:
-            output_without_adapter = pipe(**inputs, output="images")
-        else:
-            output_without_adapter = expected_pipe_slice
-
-        # 1. Single IP-Adapter test cases
-        adapter_state_dict = create_ip_adapter_state_dict(pipe.unet)
-        pipe.unet._load_ip_adapter_weights(adapter_state_dict)
-
-        # forward pass with single ip adapter, but scale=0 which should have no effect
-        inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
-        inputs["ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
-        inputs["negative_ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
-        pipe.set_ip_adapter_scale(0.0)
-        output_without_adapter_scale = pipe(**inputs, output="images")
-        if expected_pipe_slice is not None:
-            output_without_adapter_scale = output_without_adapter_scale[0, -3:, -3:, -1].flatten()
-
-        # forward pass with single ip adapter, but with scale of adapter weights
-        inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
-        inputs["ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
-        inputs["negative_ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)]
-        pipe.set_ip_adapter_scale(42.0)
-        output_with_adapter_scale = pipe(**inputs, output="images")
-        if expected_pipe_slice is not None:
-            output_with_adapter_scale = output_with_adapter_scale[0, -3:, -3:, -1].flatten()
-
-        max_diff_without_adapter_scale = np.abs(output_without_adapter_scale - output_without_adapter).max()
-        max_diff_with_adapter_scale = np.abs(output_with_adapter_scale - output_without_adapter).max()
-
-        assert max_diff_without_adapter_scale < expected_max_diff, (
-            "Output without ip-adapter must be same as normal inference"
-        )
-        assert max_diff_with_adapter_scale > 1e-2, "Output with ip-adapter must be different from normal inference"
-
-        # 2. Multi IP-Adapter test cases
-        adapter_state_dict_1 = create_ip_adapter_state_dict(pipe.unet)
-        adapter_state_dict_2 = create_ip_adapter_state_dict(pipe.unet)
-        pipe.unet._load_ip_adapter_weights([adapter_state_dict_1, adapter_state_dict_2])
-
-        # forward pass with multi ip adapter, but scale=0 which should have no effect
-        inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
-        inputs["ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)] * 2
-        inputs["negative_ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)] * 2
-        pipe.set_ip_adapter_scale([0.0, 0.0])
-        output_without_multi_adapter_scale = pipe(**inputs, output="images")
-        if expected_pipe_slice is not None:
-            output_without_multi_adapter_scale = output_without_multi_adapter_scale[0, -3:, -3:, -1].flatten()
-
-        # forward pass with multi ip adapter, but with scale of adapter weights
-        inputs = self._modify_inputs_for_ip_adapter_test(self.get_dummy_inputs(torch_device))
-        inputs["ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)] * 2
-        inputs["negative_ip_adapter_embeds"] = [self._get_dummy_image_embeds(cross_attention_dim)] * 2
-        pipe.set_ip_adapter_scale([42.0, 42.0])
-        output_with_multi_adapter_scale = pipe(**inputs, output="images")
-        if expected_pipe_slice is not None:
-            output_with_multi_adapter_scale = output_with_multi_adapter_scale[0, -3:, -3:, -1].flatten()
-
-        max_diff_without_multi_adapter_scale = np.abs(
-            output_without_multi_adapter_scale - output_without_adapter
-        ).max()
-        max_diff_with_multi_adapter_scale = np.abs(output_with_multi_adapter_scale - output_without_adapter).max()
-        assert max_diff_without_multi_adapter_scale < expected_max_diff, (
-            "Output without multi-ip-adapter must be same as normal inference"
-        )
-        assert max_diff_with_multi_adapter_scale > 1e-2, (
-            "Output with multi-ip-adapter scale must be different from normal inference"
-        )
-
-
-class SDXLModularControlNetTests:
-    """
-    This mixin is designed to test ControlNet.
-    """
-
-    def test_pipeline_inputs(self):
-        blocks = self.pipeline_blocks_class()
-        parameters = blocks.input_names
-
-        assert "control_image" in parameters, "`control_image` argument must be supported by the `__call__` method"
-        assert "controlnet_conditioning_scale" in parameters, (
-            "`controlnet_conditioning_scale` argument must be supported by the `__call__` method"
-        )
-
-    def _modify_inputs_for_controlnet_test(self, inputs: Dict[str, Any]):
-        controlnet_embedder_scale_factor = 2
-        image = torch.randn(
-            (1, 3, 32 * controlnet_embedder_scale_factor, 32 * controlnet_embedder_scale_factor),
-            device=torch_device,
-        )
-        inputs["control_image"] = image
-        return inputs
-
-    def test_controlnet(self, expected_max_diff: float = 1e-4, expected_pipe_slice=None):
-        r"""Tests for ControlNet.
-
-        The following scenarios are tested:
-          - Single ControlNet with scale=0 should produce same output as no ControlNet.
-          - Single ControlNet with scale!=0 should produce different output compared to no ControlNet.
-        """
-        # Raising the tolerance for this test when it's run on a CPU because we
-        # compare against static slices and that can be shaky (with a VVVV low probability).
-        expected_max_diff = 9e-4 if torch_device == "cpu" else expected_max_diff
-
-        pipe = self.get_pipeline()
-        pipe = pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-
-        # forward pass without controlnet
-        inputs = self.get_dummy_inputs(torch_device)
-        output_without_controlnet = pipe(**inputs, output="images")
-        output_without_controlnet = output_without_controlnet[0, -3:, -3:, -1].flatten()
-
-        # forward pass with single controlnet, but scale=0 which should have no effect
-        inputs = self._modify_inputs_for_controlnet_test(self.get_dummy_inputs(torch_device))
-        inputs["controlnet_conditioning_scale"] = 0.0
-        output_without_controlnet_scale = pipe(**inputs, output="images")
-        output_without_controlnet_scale = output_without_controlnet_scale[0, -3:, -3:, -1].flatten()
-
-        # forward pass with single controlnet, but with scale of adapter weights
-        inputs = self._modify_inputs_for_controlnet_test(self.get_dummy_inputs(torch_device))
-        inputs["controlnet_conditioning_scale"] = 42.0
-        output_with_controlnet_scale = pipe(**inputs, output="images")
-        output_with_controlnet_scale = output_with_controlnet_scale[0, -3:, -3:, -1].flatten()
-
-        max_diff_without_controlnet_scale = np.abs(output_without_controlnet_scale - output_without_controlnet).max()
-        max_diff_with_controlnet_scale = np.abs(output_with_controlnet_scale - output_without_controlnet).max()
-
-        assert max_diff_without_controlnet_scale < expected_max_diff, (
-            "Output without controlnet must be same as normal inference"
-        )
-        assert max_diff_with_controlnet_scale > 1e-2, "Output with controlnet must be different from normal inference"
-
-    def test_controlnet_cfg(self):
-        pipe = self.get_pipeline()
-        pipe = pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-
-        # forward pass with CFG not applied
-        guider = ClassifierFreeGuidance(guidance_scale=1.0)
-        pipe.update_components(guider=guider)
-
-        inputs = self._modify_inputs_for_controlnet_test(self.get_dummy_inputs(torch_device))
-        out_no_cfg = pipe(**inputs, output="images")
-
-        # forward pass with CFG applied
-        guider = ClassifierFreeGuidance(guidance_scale=7.5)
-        pipe.update_components(guider=guider)
-        inputs = self._modify_inputs_for_controlnet_test(self.get_dummy_inputs(torch_device))
-        out_cfg = pipe(**inputs, output="images")
-
-        assert out_cfg.shape == out_no_cfg.shape
-        max_diff = np.abs(out_cfg - out_no_cfg).max()
-        assert max_diff > 1e-2, "Output with CFG must be different from normal inference"
-
-
-class SDXLModularGuiderTests:
-    def test_guider_cfg(self):
-        pipe = self.get_pipeline()
-        pipe = pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-
-        # forward pass with CFG not applied
-        guider = ClassifierFreeGuidance(guidance_scale=1.0)
-        pipe.update_components(guider=guider)
-
-        inputs = self.get_dummy_inputs(torch_device)
-        out_no_cfg = pipe(**inputs, output="images")
-
-        # forward pass with CFG applied
-        guider = ClassifierFreeGuidance(guidance_scale=7.5)
-        pipe.update_components(guider=guider)
-        inputs = self.get_dummy_inputs(torch_device)
-        out_cfg = pipe(**inputs, output="images")
-
-        assert out_cfg.shape == out_no_cfg.shape
-        max_diff = np.abs(out_cfg - out_no_cfg).max()
-        assert max_diff > 1e-2, "Output with CFG must be different from normal inference"
-
-
-class SDXLModularPipelineFastTests(
-    SDXLModularTests,
-    SDXLModularIPAdapterTests,
-    SDXLModularControlNetTests,
-    SDXLModularGuiderTests,
-    ModularPipelineTesterMixin,
-    unittest.TestCase,
-):
-    """Test cases for Stable Diffusion XL modular pipeline fast tests."""
-
-    def test_stable_diffusion_xl_euler(self):
-        self._test_stable_diffusion_xl_euler(
-            expected_image_shape=(1, 64, 64, 3),
-            expected_slice=[
-                0.5966781,
-                0.62939394,
-                0.48465094,
-                0.51573336,
-                0.57593524,
-                0.47035995,
-                0.53410417,
-                0.51436996,
-                0.47313565,
-            ],
-            expected_max_diff=1e-2,
-        )
-
-    def test_inference_batch_single_identical(self):
-        super().test_inference_batch_single_identical(expected_max_diff=3e-3)
-
-    def test_stable_diffusion_xl_save_from_pretrained(self):
-        pipes = []
-        sd_pipe = self.get_pipeline().to(torch_device)
-        pipes.append(sd_pipe)
-
-        with tempfile.TemporaryDirectory() as tmpdirname:
-            sd_pipe.save_pretrained(tmpdirname)
-            sd_pipe = ModularPipeline.from_pretrained(tmpdirname).to(torch_device)
-            sd_pipe.load_default_components(torch_dtype=torch.float32)
-            sd_pipe.to(torch_device)
-        pipes.append(sd_pipe)
-
-        image_slices = []
-        for pipe in pipes:
-            inputs = self.get_dummy_inputs(torch_device)
-            image = pipe(**inputs, output="images")
-
-            image_slices.append(image[0, -3:, -3:, -1].flatten())
-
-        assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
-
-
-class SDXLImg2ImgModularPipelineFastTests(
-    SDXLModularTests,
-    SDXLModularIPAdapterTests,
-    SDXLModularControlNetTests,
-    SDXLModularGuiderTests,
-    ModularPipelineTesterMixin,
-    unittest.TestCase,
-):
-    """Test cases for Stable Diffusion XL image-to-image modular pipeline fast tests."""
-
-    def get_dummy_inputs(self, device, seed=0):
-        inputs = super().get_dummy_inputs(device, seed)
-        image = floats_tensor((1, 3, 64, 64), rng=random.Random(seed)).to(device)
-        image = image / 2 + 0.5
-        inputs["image"] = image
-        inputs["strength"] = 0.8
-
-        return inputs
-
-    def test_stable_diffusion_xl_euler(self):
-        self._test_stable_diffusion_xl_euler(
-            expected_image_shape=(1, 64, 64, 3),
-            expected_slice=[
-                0.56943184,
-                0.4702148,
-                0.48048905,
-                0.6235963,
-                0.551138,
-                0.49629188,
-                0.60031277,
-                0.5688907,
-                0.43996853,
-            ],
-            expected_max_diff=1e-2,
-        )
-
-    def test_inference_batch_single_identical(self):
-        super().test_inference_batch_single_identical(expected_max_diff=3e-3)
-
-
-class SDXLInpaintingModularPipelineFastTests(
-    SDXLModularTests,
-    SDXLModularIPAdapterTests,
-    SDXLModularControlNetTests,
-    SDXLModularGuiderTests,
-    ModularPipelineTesterMixin,
-    unittest.TestCase,
-):
-    """Test cases for Stable Diffusion XL inpainting modular pipeline fast tests."""
-
-    def get_dummy_inputs(self, device, seed=0):
-        inputs = super().get_dummy_inputs(device, seed)
-        image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
-        image = image.cpu().permute(0, 2, 3, 1)[0]
-        init_image = Image.fromarray(np.uint8(image)).convert("RGB").resize((64, 64))
-        # create mask
-        image[8:, 8:, :] = 255
-        mask_image = Image.fromarray(np.uint8(image)).convert("L").resize((64, 64))
-
-        inputs["image"] = init_image
-        inputs["mask_image"] = mask_image
-        inputs["strength"] = 1.0
-
-        return inputs
-
-    def test_stable_diffusion_xl_euler(self):
-        self._test_stable_diffusion_xl_euler(
-            expected_image_shape=(1, 64, 64, 3),
-            expected_slice=[
-                0.40872607,
-                0.38842705,
-                0.34893104,
-                0.47837183,
-                0.43792963,
-                0.5332134,
-                0.3716843,
-                0.47274873,
-                0.45000193,
-            ],
-            expected_max_diff=1e-2,
-        )
-
-    def test_inference_batch_single_identical(self):
-        super().test_inference_batch_single_identical(expected_max_diff=3e-3)
@@ -1,358 +0,0 @@
-import gc
-import tempfile
-import unittest
-from typing import Callable, Union
-
-import numpy as np
-import torch
-
-import diffusers
-from diffusers import ComponentsManager, ModularPipeline, ModularPipelineBlocks
-from diffusers.utils import logging
-from diffusers.utils.testing_utils import (
-    backend_empty_cache,
-    numpy_cosine_similarity_distance,
-    require_accelerator,
-    require_torch,
-    torch_device,
-)
-
-
-def to_np(tensor):
-    if isinstance(tensor, torch.Tensor):
-        tensor = tensor.detach().cpu().numpy()
-
-    return tensor
-
-
-@require_torch
-class ModularPipelineTesterMixin:
-    """
-    This mixin is designed to be used with unittest.TestCase classes.
-    It provides a set of common tests for each modular pipeline,
-    including:
-    - test_pipeline_call_signature: check if the pipeline's __call__ method has all required parameters
-    - test_inference_batch_consistent: check if the pipeline's __call__ method can handle batch inputs
-    - test_inference_batch_single_identical: check if the pipeline's __call__ method can handle single input
-    - test_float16_inference: check if the pipeline's __call__ method can handle float16 inputs
-    - test_to_device: check if the pipeline's __call__ method can handle different devices
-    """
-
-    # Canonical parameters that are passed to `__call__` regardless
-    # of the type of pipeline. They are always optional and have common
-    # sense default values.
-    optional_params = frozenset(
-        [
-            "num_inference_steps",
-            "num_images_per_prompt",
-            "latents",
-            "output_type",
-        ]
-    )
-    # this is modular specific: generator needs to be a intermediate input because it's mutable
-    intermediate_params = frozenset(
-        [
-            "generator",
-        ]
-    )
-
-    def get_generator(self, seed):
-        device = torch_device if torch_device != "mps" else "cpu"
-        generator = torch.Generator(device).manual_seed(seed)
-        return generator
-
-    @property
-    def pipeline_class(self) -> Union[Callable, ModularPipeline]:
-        raise NotImplementedError(
-            "You need to set the attribute `pipeline_class = ClassNameOfPipeline` in the child test class. "
-            "See existing pipeline tests for reference."
-        )
-
-    @property
-    def repo(self) -> str:
-        raise NotImplementedError(
-            "You need to set the attribute `repo` in the child test class. See existing pipeline tests for reference."
-        )
-
-    @property
-    def pipeline_blocks_class(self) -> Union[Callable, ModularPipelineBlocks]:
-        raise NotImplementedError(
-            "You need to set the attribute `pipeline_blocks_class = ClassNameOfPipelineBlocks` in the child test class. "
-            "See existing pipeline tests for reference."
-        )
-
-    def get_pipeline(self):
-        raise NotImplementedError(
-            "You need to implement `get_pipeline(self)` in the child test class. "
-            "See existing pipeline tests for reference."
-        )
-
-    def get_dummy_inputs(self, device, seed=0):
-        raise NotImplementedError(
-            "You need to implement `get_dummy_inputs(self, device, seed)` in the child test class. "
-            "See existing pipeline tests for reference."
-        )
-
-    @property
-    def params(self) -> frozenset:
-        raise NotImplementedError(
-            "You need to set the attribute `params` in the child test class. "
-            "`params` are checked for if all values are present in `__call__`'s signature."
-            " You can set `params` using one of the common set of parameters defined in `pipeline_params.py`"
-            " e.g., `TEXT_TO_IMAGE_PARAMS` defines the common parameters used in text to  "
-            "image pipelines, including prompts and prompt embedding overrides."
-            "If your pipeline's set of arguments has minor changes from one of the common sets of arguments, "
-            "do not make modifications to the existing common sets of arguments. I.e. a text to image pipeline "
-            "with non-configurable height and width arguments should set the attribute as "
-            "`params = TEXT_TO_IMAGE_PARAMS - {'height', 'width'}`. "
-            "See existing pipeline tests for reference."
-        )
-
-    @property
-    def batch_params(self) -> frozenset:
-        raise NotImplementedError(
-            "You need to set the attribute `batch_params` in the child test class. "
-            "`batch_params` are the parameters required to be batched when passed to the pipeline's "
-            "`__call__` method. `pipeline_params.py` provides some common sets of parameters such as "
-            "`TEXT_TO_IMAGE_BATCH_PARAMS`, `IMAGE_VARIATION_BATCH_PARAMS`, etc... If your pipeline's "
-            "set of batch arguments has minor changes from one of the common sets of batch arguments, "
-            "do not make modifications to the existing common sets of batch arguments. I.e. a text to "
-            "image pipeline `negative_prompt` is not batched should set the attribute as "
-            "`batch_params = TEXT_TO_IMAGE_BATCH_PARAMS - {'negative_prompt'}`. "
-            "See existing pipeline tests for reference."
-        )
-
-    def setUp(self):
-        # clean up the VRAM before each test
-        super().setUp()
-        torch.compiler.reset()
-        gc.collect()
-        backend_empty_cache(torch_device)
-
-    def tearDown(self):
-        # clean up the VRAM after each test in case of CUDA runtime errors
-        super().tearDown()
-        torch.compiler.reset()
-        gc.collect()
-        backend_empty_cache(torch_device)
-
-    def test_pipeline_call_signature(self):
-        pipe = self.get_pipeline()
-        input_parameters = pipe.blocks.input_names
-        optional_parameters = pipe.default_call_parameters
-
-        def _check_for_parameters(parameters, expected_parameters, param_type):
-            remaining_parameters = {param for param in parameters if param not in expected_parameters}
-            assert len(remaining_parameters) == 0, (
-                f"Required {param_type} parameters not present: {remaining_parameters}"
-            )
-
-        _check_for_parameters(self.params, input_parameters, "input")
-        _check_for_parameters(self.optional_params, optional_parameters, "optional")
-
-    def test_inference_batch_consistent(self, batch_sizes=[2], batch_generator=True):
-        pipe = self.get_pipeline()
-        pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-
-        inputs = self.get_dummy_inputs(torch_device)
-        inputs["generator"] = self.get_generator(0)
-
-        logger = logging.get_logger(pipe.__module__)
-        logger.setLevel(level=diffusers.logging.FATAL)
-
-        # prepare batched inputs
-        batched_inputs = []
-        for batch_size in batch_sizes:
-            batched_input = {}
-            batched_input.update(inputs)
-
-            for name in self.batch_params:
-                if name not in inputs:
-                    continue
-
-                value = inputs[name]
-                batched_input[name] = batch_size * [value]
-
-            if batch_generator and "generator" in inputs:
-                batched_input["generator"] = [self.get_generator(i) for i in range(batch_size)]
-
-            if "batch_size" in inputs:
-                batched_input["batch_size"] = batch_size
-
-            batched_inputs.append(batched_input)
-
-        logger.setLevel(level=diffusers.logging.WARNING)
-        for batch_size, batched_input in zip(batch_sizes, batched_inputs):
-            output = pipe(**batched_input, output="images")
-            assert len(output) == batch_size, "Output is different from expected batch size"
-
-    def test_inference_batch_single_identical(
-        self,
-        batch_size=2,
-        expected_max_diff=1e-4,
-    ):
-        pipe = self.get_pipeline()
-        pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-        inputs = self.get_dummy_inputs(torch_device)
-
-        # Reset generator in case it is has been used in self.get_dummy_inputs
-        inputs["generator"] = self.get_generator(0)
-
-        logger = logging.get_logger(pipe.__module__)
-        logger.setLevel(level=diffusers.logging.FATAL)
-
-        # batchify inputs
-        batched_inputs = {}
-        batched_inputs.update(inputs)
-
-        for name in self.batch_params:
-            if name not in inputs:
-                continue
-
-            value = inputs[name]
-            batched_inputs[name] = batch_size * [value]
-
-        if "generator" in inputs:
-            batched_inputs["generator"] = [self.get_generator(i) for i in range(batch_size)]
-
-        if "batch_size" in inputs:
-            batched_inputs["batch_size"] = batch_size
-
-        output = pipe(**inputs, output="images")
-        output_batch = pipe(**batched_inputs, output="images")
-
-        assert output_batch.shape[0] == batch_size
-
-        max_diff = np.abs(to_np(output_batch[0]) - to_np(output[0])).max()
-        assert max_diff < expected_max_diff, "Batch inference results different from single inference results"
-
-    @unittest.skipIf(torch_device not in ["cuda", "xpu"], reason="float16 requires CUDA or XPU")
-    @require_accelerator
-    def test_float16_inference(self, expected_max_diff=5e-2):
-        pipe = self.get_pipeline()
-        pipe.to(torch_device, torch.float32)
-        pipe.set_progress_bar_config(disable=None)
-
-        pipe_fp16 = self.get_pipeline()
-        pipe_fp16.to(torch_device, torch.float16)
-        pipe_fp16.set_progress_bar_config(disable=None)
-
-        inputs = self.get_dummy_inputs(torch_device)
-        # Reset generator in case it is used inside dummy inputs
-        if "generator" in inputs:
-            inputs["generator"] = self.get_generator(0)
-        output = pipe(**inputs, output="images")
-
-        fp16_inputs = self.get_dummy_inputs(torch_device)
-        # Reset generator in case it is used inside dummy inputs
-        if "generator" in fp16_inputs:
-            fp16_inputs["generator"] = self.get_generator(0)
-        output_fp16 = pipe_fp16(**fp16_inputs, output="images")
-
-        if isinstance(output, torch.Tensor):
-            output = output.cpu()
-            output_fp16 = output_fp16.cpu()
-
-        max_diff = numpy_cosine_similarity_distance(output.flatten(), output_fp16.flatten())
-        assert max_diff < expected_max_diff, "FP16 inference is different from FP32 inference"
-
-    @require_accelerator
-    def test_to_device(self):
-        pipe = self.get_pipeline()
-        pipe.set_progress_bar_config(disable=None)
-
-        pipe.to("cpu")
-        model_devices = [
-            component.device.type for component in pipe.components.values() if hasattr(component, "device")
-        ]
-        assert all(device == "cpu" for device in model_devices), "All pipeline components are not on CPU"
-
-        pipe.to(torch_device)
-        model_devices = [
-            component.device.type for component in pipe.components.values() if hasattr(component, "device")
-        ]
-        assert all(device == torch_device for device in model_devices), (
-            "All pipeline components are not on accelerator device"
-        )
-
-    def test_inference_is_not_nan_cpu(self):
-        pipe = self.get_pipeline()
-        pipe.set_progress_bar_config(disable=None)
-        pipe.to("cpu")
-
-        output = pipe(**self.get_dummy_inputs("cpu"), output="images")
-        assert np.isnan(to_np(output)).sum() == 0, "CPU Inference returns NaN"
-
-    @require_accelerator
-    def test_inference_is_not_nan(self):
-        pipe = self.get_pipeline()
-        pipe.set_progress_bar_config(disable=None)
-        pipe.to(torch_device)
-
-        output = pipe(**self.get_dummy_inputs(torch_device), output="images")
-        assert np.isnan(to_np(output)).sum() == 0, "Accelerator Inference returns NaN"
-
-    def test_num_images_per_prompt(self):
-        pipe = self.get_pipeline()
-
-        if "num_images_per_prompt" not in pipe.blocks.input_names:
-            return
-
-        pipe = pipe.to(torch_device)
-        pipe.set_progress_bar_config(disable=None)
-
-        batch_sizes = [1, 2]
-        num_images_per_prompts = [1, 2]
-
-        for batch_size in batch_sizes:
-            for num_images_per_prompt in num_images_per_prompts:
-                inputs = self.get_dummy_inputs(torch_device)
-
-                for key in inputs.keys():
-                    if key in self.batch_params:
-                        inputs[key] = batch_size * [inputs[key]]
-
-                images = pipe(**inputs, num_images_per_prompt=num_images_per_prompt, output="images")
-
-                assert images.shape[0] == batch_size * num_images_per_prompt
-
-    @require_accelerator
-    def test_components_auto_cpu_offload_inference_consistent(self):
-        base_pipe = self.get_pipeline().to(torch_device)
-
-        cm = ComponentsManager()
-        cm.enable_auto_cpu_offload(device=torch_device)
-        offload_pipe = self.get_pipeline(components_manager=cm)
-
-        image_slices = []
-        for pipe in [base_pipe, offload_pipe]:
-            inputs = self.get_dummy_inputs(torch_device)
-            image = pipe(**inputs, output="images")
-
-            image_slices.append(image[0, -3:, -3:, -1].flatten())
-
-        assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
-
-    def test_save_from_pretrained(self):
-        pipes = []
-        base_pipe = self.get_pipeline().to(torch_device)
-        pipes.append(base_pipe)
-
-        with tempfile.TemporaryDirectory() as tmpdirname:
-            base_pipe.save_pretrained(tmpdirname)
-            pipe = ModularPipeline.from_pretrained(tmpdirname).to(torch_device)
-            pipe.load_default_components(torch_dtype=torch.float16)
-            pipe.to(torch_device)
-
-        pipes.append(pipe)
-
-        image_slices = []
-        for pipe in pipes:
-            inputs = self.get_dummy_inputs(torch_device)
-            image = pipe(**inputs, output="images")
-
-            image_slices.append(image[0, -3:, -3:, -1].flatten())
-
-        assert np.abs(image_slices[0] - image_slices[1]).max() < 1e-3
@@ -20,6 +20,12 @@ TEXT_TO_IMAGE_PARAMS = frozenset(
    ]
 )

+TEXT_TO_IMAGE_BATCH_PARAMS = frozenset(["prompt", "negative_prompt"])
+
+TEXT_TO_IMAGE_IMAGE_PARAMS = frozenset([])
+
+IMAGE_TO_IMAGE_IMAGE_PARAMS = frozenset(["image"])
+
 IMAGE_VARIATION_PARAMS = frozenset(
    [
        "image",
@@ -29,6 +35,8 @@ IMAGE_VARIATION_PARAMS = frozenset(
    ]
 )

+IMAGE_VARIATION_BATCH_PARAMS = frozenset(["image"])
+
 TEXT_GUIDED_IMAGE_VARIATION_PARAMS = frozenset(
    [
        "prompt",
@@ -42,6 +50,8 @@ TEXT_GUIDED_IMAGE_VARIATION_PARAMS = frozenset(
    ]
 )

+TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS = frozenset(["prompt", "image", "negative_prompt"])
+
 TEXT_GUIDED_IMAGE_INPAINTING_PARAMS = frozenset(
    [
        # Text guided image variation with an image mask
@@ -57,6 +67,8 @@ TEXT_GUIDED_IMAGE_INPAINTING_PARAMS = frozenset(
    ]
 )

+TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["prompt", "image", "mask_image", "negative_prompt"])
+
 IMAGE_INPAINTING_PARAMS = frozenset(
    [
        # image variation with an image mask
@@ -68,6 +80,8 @@ IMAGE_INPAINTING_PARAMS = frozenset(
    ]
 )

+IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["image", "mask_image"])
+
 IMAGE_GUIDED_IMAGE_INPAINTING_PARAMS = frozenset(
    [
        "example_image",
@@ -79,12 +93,20 @@ IMAGE_GUIDED_IMAGE_INPAINTING_PARAMS = frozenset(
    ]
 )

-UNCONDITIONAL_IMAGE_GENERATION_PARAMS = frozenset(["batch_size"])
+IMAGE_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["example_image", "image", "mask_image"])

 CLASS_CONDITIONED_IMAGE_GENERATION_PARAMS = frozenset(["class_labels"])

 CLASS_CONDITIONED_IMAGE_GENERATION_BATCH_PARAMS = frozenset(["class_labels"])

+UNCONDITIONAL_IMAGE_GENERATION_PARAMS = frozenset(["batch_size"])
+
+UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS = frozenset([])
+
+UNCONDITIONAL_AUDIO_GENERATION_PARAMS = frozenset(["batch_size"])
+
+UNCONDITIONAL_AUDIO_GENERATION_BATCH_PARAMS = frozenset([])
+
 TEXT_TO_AUDIO_PARAMS = frozenset(
    [
        "prompt",
@@ -97,38 +119,11 @@ TEXT_TO_AUDIO_PARAMS = frozenset(
    ]
 )

-TOKENS_TO_AUDIO_GENERATION_PARAMS = frozenset(["input_tokens"])
-
-UNCONDITIONAL_AUDIO_GENERATION_PARAMS = frozenset(["batch_size"])
-
-# image params
-TEXT_TO_IMAGE_IMAGE_PARAMS = frozenset([])
-
-IMAGE_TO_IMAGE_IMAGE_PARAMS = frozenset(["image"])
-
-
-# batch params
-TEXT_TO_IMAGE_BATCH_PARAMS = frozenset(["prompt", "negative_prompt"])
-
-IMAGE_VARIATION_BATCH_PARAMS = frozenset(["image"])
-
-TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS = frozenset(["prompt", "image", "negative_prompt"])
-
-TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["prompt", "image", "mask_image", "negative_prompt"])
-
-IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["image", "mask_image"])
-
-IMAGE_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS = frozenset(["example_image", "image", "mask_image"])
-
-UNCONDITIONAL_IMAGE_GENERATION_BATCH_PARAMS = frozenset([])
-
-UNCONDITIONAL_AUDIO_GENERATION_BATCH_PARAMS = frozenset([])
-
 TEXT_TO_AUDIO_BATCH_PARAMS = frozenset(["prompt", "negative_prompt"])
+TOKENS_TO_AUDIO_GENERATION_PARAMS = frozenset(["input_tokens"])

 TOKENS_TO_AUDIO_GENERATION_BATCH_PARAMS = frozenset(["input_tokens"])

-VIDEO_TO_VIDEO_BATCH_PARAMS = frozenset(["prompt", "negative_prompt", "video"])
-
-# callback params
 TEXT_TO_IMAGE_CALLBACK_CFG_PARAMS = frozenset(["prompt_embeds"])
+
+VIDEO_TO_VIDEO_BATCH_PARAMS = frozenset(["prompt", "negative_prompt", "video"])
@@ -15,7 +15,6 @@
 import gc
 import unittest

-import numpy as np
 import torch
 from transformers import AutoTokenizer, T5EncoderModel

@@ -29,9 +28,7 @@ from diffusers.utils.testing_utils import (
 )

 from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS
-from ..test_pipelines_common import (
-    PipelineTesterMixin,
-)
+from ..test_pipelines_common import PipelineTesterMixin


 enable_full_determinism()
@@ -127,11 +124,15 @@ class WanPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
        inputs = self.get_dummy_inputs(device)
        video = pipe(**inputs).frames
        generated_video = video[0]
-
        self.assertEqual(generated_video.shape, (9, 3, 16, 16))
-        expected_video = torch.randn(9, 3, 16, 16)
-        max_diff = np.abs(generated_video - expected_video).max()
-        self.assertLessEqual(max_diff, 1e10)
+
+        # fmt: off
+        expected_slice = torch.tensor([0.4525, 0.452, 0.4485, 0.4534, 0.4524, 0.4529, 0.454, 0.453, 0.5127, 0.5326, 0.5204, 0.5253, 0.5439, 0.5424, 0.5133, 0.5078])
+        # fmt: on
+
+        generated_slice = generated_video.flatten()
+        generated_slice = torch.cat([generated_slice[:8], generated_slice[-8:]])
+        self.assertTrue(torch.allclose(generated_slice, expected_slice, atol=1e-3))

    @unittest.skip("Test not supported")
    def test_attention_slicing_forward_pass(self):
@@ -14,7 +14,6 @@

 import unittest

-import numpy as np
 import torch
 from PIL import Image
 from transformers import (
@@ -147,11 +146,15 @@ class WanImageToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
        inputs = self.get_dummy_inputs(device)
        video = pipe(**inputs).frames
        generated_video = video[0]
-
        self.assertEqual(generated_video.shape, (9, 3, 16, 16))
-        expected_video = torch.randn(9, 3, 16, 16)
-        max_diff = np.abs(generated_video - expected_video).max()
-        self.assertLessEqual(max_diff, 1e10)
+
+        # fmt: off
+        expected_slice = torch.tensor([0.4525, 0.4525, 0.4497, 0.4536, 0.452, 0.4529, 0.454, 0.4535, 0.5072, 0.5527, 0.5165, 0.5244, 0.5481, 0.5282, 0.5208, 0.5214])
+        # fmt: on
+
+        generated_slice = generated_video.flatten()
+        generated_slice = torch.cat([generated_slice[:8], generated_slice[-8:]])
+        self.assertTrue(torch.allclose(generated_slice, expected_slice, atol=1e-3))

    @unittest.skip("Test not supported")
    def test_attention_slicing_forward_pass(self):
@@ -162,7 +165,25 @@ class WanImageToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
        pass


-class WanFLFToVideoPipelineFastTests(WanImageToVideoPipelineFastTests):
+class WanFLFToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
+    pipeline_class = WanImageToVideoPipeline
+    params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs", "height", "width"}
+    batch_params = TEXT_TO_IMAGE_BATCH_PARAMS
+    image_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+    image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
+    required_optional_params = frozenset(
+        [
+            "num_inference_steps",
+            "generator",
+            "latents",
+            "return_dict",
+            "callback_on_step_end",
+            "callback_on_step_end_tensor_inputs",
+        ]
+    )
+    test_xformers_attention = False
+    supports_dduf = False
+
    def get_dummy_components(self):
        torch.manual_seed(0)
        vae = AutoencoderKLWan(
@@ -247,3 +268,32 @@ class WanFLFToVideoPipelineFastTests(WanImageToVideoPipelineFastTests):
            "output_type": "pt",
        }
        return inputs
+
+    def test_inference(self):
+        device = "cpu"
+
+        components = self.get_dummy_components()
+        pipe = self.pipeline_class(**components)
+        pipe.to(device)
+        pipe.set_progress_bar_config(disable=None)
+
+        inputs = self.get_dummy_inputs(device)
+        video = pipe(**inputs).frames
+        generated_video = video[0]
+        self.assertEqual(generated_video.shape, (9, 3, 16, 16))
+
+        # fmt: off
+        expected_slice = torch.tensor([0.4531, 0.4527, 0.4498, 0.4542, 0.4526, 0.4527, 0.4534, 0.4534, 0.5061, 0.5185, 0.5283, 0.5181, 0.5309, 0.5365, 0.5113, 0.5244])
+        # fmt: on
+
+        generated_slice = generated_video.flatten()
+        generated_slice = torch.cat([generated_slice[:8], generated_slice[-8:]])
+        self.assertTrue(torch.allclose(generated_slice, expected_slice, atol=1e-3))
+
+    @unittest.skip("Test not supported")
+    def test_attention_slicing_forward_pass(self):
+        pass
+
+    @unittest.skip("TODO: revisit failing as it requires a very high threshold to pass")
+    def test_inference_batch_single_identical(self):
+        pass
@@ -14,7 +14,6 @@

 import unittest

-import numpy as np
 import torch
 from PIL import Image
 from transformers import AutoTokenizer, T5EncoderModel
@@ -123,11 +122,15 @@ class WanVideoToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
        inputs = self.get_dummy_inputs(device)
        video = pipe(**inputs).frames
        generated_video = video[0]
-
        self.assertEqual(generated_video.shape, (17, 3, 16, 16))
-        expected_video = torch.randn(17, 3, 16, 16)
-        max_diff = np.abs(generated_video - expected_video).max()
-        self.assertLessEqual(max_diff, 1e10)
+
+        # fmt: off
+        expected_slice = torch.tensor([0.4522, 0.4534, 0.4532, 0.4553, 0.4526, 0.4538, 0.4533, 0.4547, 0.513, 0.5176, 0.5286, 0.4958, 0.4955, 0.5381, 0.5154, 0.5195])
+        # fmt:on
+
+        generated_slice = generated_video.flatten()
+        generated_slice = torch.cat([generated_slice[:8], generated_slice[-8:]])
+        self.assertTrue(torch.allclose(generated_slice, expected_slice, atol=1e-3))

    @unittest.skip("Test not supported")
    def test_attention_slicing_forward_pass(self):
Author	SHA1	Message	Date
Dhruv Nair	98954fc2e1	update	2025-07-28 05:33:00 +02:00
DN6	1262d19d16	update	2025-07-28 08:32:01 +05:30
YiYi Xu	201da97dd0	Merge branch 'main' into custom-code-updates	2025-07-23 10:23:35 -10:00
Aryan	f36ba9f094	[modular diffusers] Wan (#11913 ) * update	2025-07-23 06:19:40 -10:00
Sayak Paul	1c50a5f7e0	[tests] enforce torch version in the compilation tests. (#11979 ) enforce torch version in the compilation tests.	2025-07-23 19:42:46 +05:30
Sayak Paul	7ae6347e33	[docs] update `guidance_scale` docstring for guidance_distilled models. (#11935 ) * update guidance_scale docstring for guidance_distilled models. * Update pipeline_flux.py * Update pipeline_flux_control.py * Update pipeline_flux_kontext.py * Update pipeline_flux_kontext_inpaint.py * Update pipeline_sana_sprint.py * style * Update pipeline_hidream_image.py * Update pipeline_chroma.py * Update pipeline_chroma_img2img.py * Update pipeline_hunyuan_video.py --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>	2025-07-23 17:49:38 +05:30
Aryan	178d32dedd	[tests] Add test slices for Wan (#11920 ) * update * fix wan vace test slice * test * fix	2025-07-23 17:23:52 +05:30
YiYi Xu	ef1e628729	fix style (#11975 ) up	2025-07-22 10:25:40 -10:00
Sam Gao	173e1b147d	[Examples] Uniform notations in train_flux_lora (#10011 ) [Examples] uniform naming notations since the in parameter `size` represents `args.resolution`, I thus replace the `args.resolution` inside DreamBoothData with `size`. And revise some notations such as `center_crop`. Co-authored-by: Linoy Tsaban <57615435+linoytsaban@users.noreply.github.com>	2025-07-22 09:14:00 -10:00
Aryan	e46e139f95	Remove logger warnings for attention backends and hard error during runtime instead (#11967 ) * update * update * update	2025-07-22 20:47:44 +05:30
DN6	4423097b23	update	2025-07-22 19:31:22 +05:30
Yao Matrix	14725164be	fix "Expected all tensors to be on the same device, but found at least two devices" error (#11690 ) * xx * fix Signed-off-by: YAO Matrix <matrix.yao@intel.com> * Update model_loading_utils.py * Update test_models_unet_2d_condition.py * Update test_models_unet_2d_condition.py * fix style Signed-off-by: YAO Matrix <matrix.yao@intel.com> * fix comments Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update unet_2d_blocks.py * update Signed-off-by: Matrix Yao <matrix.yao@intel.com> --------- Signed-off-by: YAO Matrix <matrix.yao@intel.com> Signed-off-by: Matrix Yao <matrix.yao@intel.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-07-22 13:39:24 +02:00
YiYi Xu	638cc035e5	[Modular] update the collection behavior (#11963 ) * only remove from the collection	2025-07-21 08:47:07 -10:00