update

2025-07-17 21:56:48 +05:30 · 2025-07-17 19:57:45 +05:30 · 2025-07-16 19:41:48 +05:30 · 2025-07-15 10:47:41 +05:30 · 2025-07-15 09:15:57 +05:30 · 2025-07-14 14:54:38 -04:00
63 changed files with 3662 additions and 1851 deletions
@@ -188,7 +188,7 @@ jobs:
        shell: bash
    strategy:
      fail-fast: false
-      max-parallel: 2
+      max-parallel: 4
      matrix:
        module: [models, schedulers, lora, others]
    steps:
@@ -47,6 +47,10 @@ RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
        tensorboard \
        transformers \
        matplotlib \
-        setuptools==69.5.1
+        setuptools==69.5.1 \
+        bitsandbytes \
+        torchao \
+        gguf \
+        optimum-quanto

 CMD ["/bin/bash"]
@@ -94,14 +94,24 @@
    title: API Reference
  title: Hybrid Inference
 - sections:
-  - local: modular_diffusers/getting_started
-    title: Getting Started
+  - local: modular_diffusers/overview
+    title: Overview
+  - local: modular_diffusers/modular_pipeline
+    title: Modular Pipeline
  - local: modular_diffusers/components_manager
    title: Components Manager
-  - local: modular_diffusers/write_own_pipeline_block
-    title: Write your own pipeline block
+  - local: modular_diffusers/modular_diffusers_states
+    title: Modular Diffusers States
+  - local: modular_diffusers/pipeline_block
+    title: Pipeline Block
+  - local: modular_diffusers/sequential_pipeline_blocks
+    title: Sequential Pipeline Blocks
+  - local: modular_diffusers/loop_sequential_pipeline_blocks
+    title: Loop Sequential Pipeline Blocks
+  - local: modular_diffusers/auto_pipeline_blocks
+    title: Auto Pipeline Blocks
  - local: modular_diffusers/end_to_end_guide
-    title: End-to-End Developer Guide
+    title: End-to-End Example
  title: Modular Diffusers
 - sections:
  - local: using-diffusers/consisid
@@ -36,7 +36,7 @@ import torch
 from diffusers import ChromaPipeline

 pipe = ChromaPipeline.from_pretrained("lodestones/Chroma", torch_dtype=torch.bfloat16)
-pipe.enabe_model_cpu_offload()
+pipe.enable_model_cpu_offload()

 prompt = [
    "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done."
@@ -0,0 +1,316 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# AutoPipelineBlocks
+
+<Tip warning={true}>
+
+🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
+
+</Tip>
+
+`AutoPipelineBlocks` is a subclass of `ModularPipelineBlocks`. It is a multi-block that automatically selects which sub-blocks to run based on the inputs provided at runtime, creating conditional workflows that adapt to different scenarios. The main purpose is convenience and portability - for developers, you can package everything into one workflow, making it easier to share and use.
+
+In this tutorial, we will show you how to create an `AutoPipelineBlocks` and learn more about how the conditional selection works.
+
+<Tip>
+
+Other types of multi-blocks include [SequentialPipelineBlocks](sequential_pipeline_blocks.md) (for linear workflows) and [LoopSequentialPipelineBlocks](loop_sequential_pipeline_blocks.md) (for iterative workflows). For information on creating individual blocks, see the [PipelineBlock guide](pipeline_block.md).
+
+Additionally, like all `ModularPipelineBlocks`, `AutoPipelineBlocks` are definitions/specifications, not runnable pipelines. You need to convert them into a `ModularPipeline` to actually execute them. For information on creating and running pipelines, see the [Modular Pipeline guide](modular_pipeline.md).
+
+</Tip>
+
+For example, you might want to support text-to-image and image-to-image tasks. Instead of creating two separate pipelines, you can create an `AutoPipelineBlocks` that automatically chooses the workflow based on whether an `image` input is provided.
+
+Let's see an example. We'll use the helper function from the [PipelineBlock guide](./pipeline_block.md) to create our blocks:
+
+**Helper Function**
+
+```py
+from diffusers.modular_pipelines import PipelineBlock, InputParam, OutputParam
+import torch
+
+def make_block(inputs=[], intermediate_inputs=[], intermediate_outputs=[], block_fn=None, description=None):
+    class TestBlock(PipelineBlock):
+        model_name = "test"
+        
+        @property
+        def inputs(self):
+            return inputs
+            
+        @property
+        def intermediate_inputs(self):
+            return intermediate_inputs
+            
+        @property
+        def intermediate_outputs(self):
+            return intermediate_outputs
+            
+        @property
+        def description(self):
+            return description if description is not None else ""
+            
+        def __call__(self, components, state):
+            block_state = self.get_block_state(state)
+            if block_fn is not None:
+                block_state = block_fn(block_state, state)
+            self.set_block_state(state, block_state)
+            return components, state
+    
+    return TestBlock
+```
+
+Now let's create a dummy `AutoPipelineBlocks` that includes dummy text-to-image, image-to-image, and inpaint pipelines.
+
+
+```py
+from diffusers.modular_pipelines import AutoPipelineBlocks 
+
+# These are dummy blocks and we only focus on "inputs" for our purpose
+inputs = [InputParam(name="prompt")]
+# block_fn prints out which workflow is running so we can see the execution order at runtime
+block_fn = lambda x, y: print("running the text-to-image workflow")
+block_t2i_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a text-to-image workflow!")
+
+inputs = [InputParam(name="prompt"), InputParam(name="image")]
+block_fn = lambda x, y: print("running the image-to-image workflow")
+block_i2i_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a image-to-image workflow!")
+
+inputs = [InputParam(name="prompt"), InputParam(name="image"), InputParam(name="mask")]
+block_fn = lambda x, y: print("running the inpaint workflow")
+block_inpaint_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a inpaint workflow!")
+
+class AutoImageBlocks(AutoPipelineBlocks):
+    # List of sub-block classes to choose from
+    block_classes = [block_inpaint_cls, block_i2i_cls, block_t2i_cls]
+    # Names for each block in the same order
+    block_names = ["inpaint", "img2img", "text2img"]
+    # Trigger inputs that determine which block to run
+    # - "mask" triggers inpaint workflow
+    # - "image" triggers img2img workflow (but only if mask is not provided) 
+    # - if none of above, runs the text2img workflow (default)
+    block_trigger_inputs = ["mask", "image", None]
+    # Description is extremely important for AutoPipelineBlocks
+    @property
+    def description(self):
+        return (
+            "Pipeline generates images given different types of conditions!\n"
+            + "This is an auto pipeline block that works for text2img, img2img and inpainting tasks.\n"
+            + " - inpaint workflow is run when `mask` is provided.\n"
+            + " - img2img workflow is run when `image` is provided (but only when `mask` is not provided).\n"
+            + " - text2img workflow is run when neither `image` nor `mask` is provided.\n"
+        )
+
+# Create the blocks
+auto_blocks = AutoImageBlocks()
+# convert to pipeline
+auto_pipeline = auto_blocks.init_pipeline()
+```
+
+Now we have created an `AutoPipelineBlocks` that contains 3 sub-blocks. Notice the warning message at the top - this automatically appears in every `ModularPipelineBlocks` that contains `AutoPipelineBlocks` to remind end users that dynamic block selection happens at runtime. 
+
+```py
+AutoImageBlocks(
+  Class: AutoPipelineBlocks
+
+  ====================================================================================================
+  This pipeline contains blocks that are selected at runtime based on inputs.
+  Trigger Inputs: ['mask', 'image']
+  ====================================================================================================
+
+
+  Description: Pipeline generates images given different types of conditions!
+      This is an auto pipeline block that works for text2img, img2img and inpainting tasks.
+       - inpaint workflow is run when `mask` is provided.
+       - img2img workflow is run when `image` is provided (but only when `mask` is not provided).
+       - text2img workflow is run when neither `image` nor `mask` is provided.
+      
+
+
+  Sub-Blocks:
+    • inpaint [trigger: mask] (TestBlock)
+       Description: I'm a inpaint workflow!
+
+    • img2img [trigger: image] (TestBlock)
+       Description: I'm a image-to-image workflow!
+
+    • text2img [default] (TestBlock)
+       Description: I'm a text-to-image workflow!
+
+)
+```
+
+Check out the documentation with `print(auto_pipeline.doc)`:
+
+```py
+>>> print(auto_pipeline.doc)
+class AutoImageBlocks
+
+  Pipeline generates images given different types of conditions!
+  This is an auto pipeline block that works for text2img, img2img and inpainting tasks.
+   - inpaint workflow is run when `mask` is provided.
+   - img2img workflow is run when `image` is provided (but only when `mask` is not provided).
+   - text2img workflow is run when neither `image` nor `mask` is provided.
+
+  Inputs:
+
+      prompt (`None`, *optional*):
+
+      image (`None`, *optional*):
+
+      mask (`None`, *optional*):
+```
+
+There is a fundamental trade-off of AutoPipelineBlocks: it trades clarity for convenience. While it is really easy for packaging multiple workflows, it can become confusing without proper documentation. e.g. if we just throw a pipeline at you and tell you that it contains 3 sub-blocks and takes 3 inputs `prompt`, `image` and `mask`, and ask you to run an image-to-image workflow: if you don't have any prior knowledge on how these pipelines work, you would be pretty clueless, right?
+
+This pipeline we just made though, has a docstring that shows all available inputs and workflows and explains how to use each with different inputs. So it's really helpful for users. For example, it's clear that you need to pass `image` to run img2img. This is why the description field is absolutely critical for AutoPipelineBlocks. We highly recommend you to explain the conditional logic very well for each `AutoPipelineBlocks` you would make. We also recommend to always test individual pipelines first before packaging them into AutoPipelineBlocks. 
+
+Let's run this auto pipeline with different inputs to see if the conditional logic works as described. Remember that we have added `print` in each `PipelineBlock`'s `__call__` method to print out its workflow name, so it should be easy to tell which one is running:
+
+```py
+>>> _ = auto_pipeline(image="image", mask="mask")
+running the inpaint workflow
+>>> _ = auto_pipeline(image="image")
+running the image-to-image workflow
+>>> _ = auto_pipeline(prompt="prompt")
+running the text-to-image workflow
+>>> _ = auto_pipeline(image="prompt", mask="mask")
+running the inpaint workflow
+```
+
+However, even with documentation, it can become very confusing when AutoPipelineBlocks are combined with other blocks. The complexity grows quickly when you have nested AutoPipelineBlocks or use them as sub-blocks in larger pipelines.
+
+Let's make another `AutoPipelineBlocks` - this one only contains one block, and it does not include `None` in its `block_trigger_inputs` (which corresponds to the default block to run when none of the trigger inputs are provided). This means this block will be skipped if the trigger input (`ip_adapter_image`) is not provided at runtime.
+
+```py
+from diffusers.modular_pipelines import SequentialPipelineBlocks, InsertableDict
+inputs = [InputParam(name="ip_adapter_image")]
+block_fn = lambda x, y: print("running the ip-adapter workflow")
+block_ipa_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a IP-adapter workflow!")
+
+class AutoIPAdapter(AutoPipelineBlocks):
+    block_classes = [block_ipa_cls]
+    block_names = ["ip-adapter"]
+    block_trigger_inputs = ["ip_adapter_image"]
+    @property
+    def description(self):
+        return "Run IP Adapter step if `ip_adapter_image` is provided."
+```
+
+Now let's combine these 2 auto blocks together into a `SequentialPipelineBlocks`:
+
+```py
+auto_ipa_blocks = AutoIPAdapter()
+blocks_dict = InsertableDict()
+blocks_dict["ip-adapter"] = auto_ipa_blocks
+blocks_dict["image-generation"] = auto_blocks
+all_blocks = SequentialPipelineBlocks.from_blocks_dict(blocks_dict)
+pipeline = all_blocks.init_pipeline()
+```
+
+Let's take a look: now things get more confusing. In this particular example, you could still try to explain the conditional logic in the `description` field here - there are only 4 possible execution paths so it's doable. However, since this is a `SequentialPipelineBlocks` that could contain many more blocks, the complexity can quickly get out of hand as the number of blocks increases.
+
+```py
+>>> all_blocks
+SequentialPipelineBlocks(
+  Class: ModularPipelineBlocks
+
+  ====================================================================================================
+  This pipeline contains blocks that are selected at runtime based on inputs.
+  Trigger Inputs: ['image', 'mask', 'ip_adapter_image']
+  Use `get_execution_blocks()` with input names to see selected blocks (e.g. `get_execution_blocks('image')`).
+  ====================================================================================================
+
+
+  Description: 
+
+
+  Sub-Blocks:
+    [0] ip-adapter (AutoIPAdapter)
+       Description: Run IP Adapter step if `ip_adapter_image` is provided.
+                   
+
+    [1] image-generation (AutoImageBlocks)
+       Description: Pipeline generates images given different types of conditions!
+                   This is an auto pipeline block that works for text2img, img2img and inpainting tasks.
+                    - inpaint workflow is run when `mask` is provided.
+                    - img2img workflow is run when `image` is provided (but only when `mask` is not provided).
+                    - text2img workflow is run when neither `image` nor `mask` is provided.
+                   
+
+)
+
+```
+
+This is when the `get_execution_blocks()` method comes in handy - it basically extracts a `SequentialPipelineBlocks` that only contains the blocks that are actually run based on your inputs.
+
+Let's try some examples:
+
+`mask`: we expect it to skip the first ip-adapter since `ip_adapter_image` is not provided, and then run the inpaint for the second block.
+
+```py
+>>> all_blocks.get_execution_blocks('mask')
+SequentialPipelineBlocks(
+  Class: ModularPipelineBlocks
+
+  Description: 
+
+
+  Sub-Blocks:
+    [0] image-generation (TestBlock)
+       Description: I'm a inpaint workflow!
+
+)
+```
+
+Let's also actually run the pipeline to confirm:
+
+```py
+>>> _ = pipeline(mask="mask")
+skipping auto block: AutoIPAdapter
+running the inpaint workflow
+```
+
+Try a few more:
+
+```py
+print(f"inputs: ip_adapter_image:")
+blocks_select = all_blocks.get_execution_blocks('ip_adapter_image')
+print(f"expected_execution_blocks: {blocks_select}")
+print(f"actual execution blocks:")
+_ = pipeline(ip_adapter_image="ip_adapter_image", prompt="prompt")
+# expect to see ip-adapter + text2img
+
+print(f"inputs: image:")
+blocks_select = all_blocks.get_execution_blocks('image')
+print(f"expected_execution_blocks: {blocks_select}")
+print(f"actual execution blocks:")
+_ = pipeline(image="image", prompt="prompt")
+# expect to see img2img
+
+print(f"inputs: prompt:")
+blocks_select = all_blocks.get_execution_blocks('prompt')
+print(f"expected_execution_blocks: {blocks_select}")
+print(f"actual execution blocks:")
+_ = pipeline(prompt="prompt")
+# expect to see text2img (prompt is not a trigger input so fallback to default)
+
+print(f"inputs: mask + ip_adapter_image:")
+blocks_select = all_blocks.get_execution_blocks('mask','ip_adapter_image')
+print(f"expected_execution_blocks: {blocks_select}")
+print(f"actual execution blocks:")
+_ = pipeline(mask="mask", ip_adapter_image="ip_adapter_image")
+# expect to see ip-adapter + inpaint
+```
+
+In summary, `AutoPipelineBlocks` is a good tool for packaging multiple workflows into a single, convenient interface and it can greatly simplify the user experience. However, always provide clear descriptions explaining the conditional logic, test individual pipelines first before combining them, and use `get_execution_blocks()` to understand runtime behavior in complex compositions.
@@ -18,12 +18,12 @@ specific language governing permissions and limitations under the License.

 </Tip>

-The Components Manager is a central model registry and management system in diffusers. It lets you add models then reuse them across multiple pipelines and workflows. It tracks all models in one place with useful metadata such as model size, device placement and loaded adapters (LoRA, IP-Adapter). It has mechanisms in place to prevent duplicate model instances, enables memory-efficient sharing. Most significantly, it offers offloading that works across pipelines — unlike regular DiffusionPipeline offloading which is limited to one pipeline with predefined sequences, the Components Manager automatically manages your device memory across all your models and workflows. 
+The Components Manager is a central model registry and management system in diffusers. It lets you add models then reuse them across multiple pipelines and workflows. It tracks all models in one place with useful metadata such as model size, device placement and loaded adapters (LoRA, IP-Adapter). It has mechanisms in place to prevent duplicate model instances, enables memory-efficient sharing. Most significantly, it offers offloading that works across pipelines — unlike regular DiffusionPipeline offloading (i.e. `enable_model_cpu_offload` and `enable_sequential_cpu_offload`) which is limited to one pipeline with predefined sequences, the Components Manager automatically manages your device memory across all your models and workflows. 


 ## Basic Operations

-Let's start with the fundamental operations. First, create a Components Manager:
+Let's start with the most basic operations. First, create a Components Manager:

 ```py
 from diffusers import ComponentsManager
@@ -144,9 +144,9 @@ Components:
 ======================================================================================================================================================================================================
 Models:
 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-Name_ID                             | Class                     | Device: act(exec)    | Dtype           | Size (GB)  | Load ID                                                         | Collection
+Name_ID                                 | Class                     | Device: act(exec)    | Dtype           | Size (GB)  | Load ID                                                         | Collection
 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-text_encoder_139918506246832        | CLIPTextModel             | cpu                  | torch.float32   | 0.46       | stabilityai/stable-diffusion-xl-base-1.0|text_encoder|null|null | N/A
+text_encoder_139918506246832            | CLIPTextModel             | cpu                  | torch.float32   | 0.46       | stabilityai/stable-diffusion-xl-base-1.0|text_encoder|null|null | N/A
 text_encoder_duplicated_139917580682672 | CLIPTextModel             | cpu                  | torch.float32   | 0.46       | stabilityai/stable-diffusion-xl-base-1.0|text_encoder|null|null | N/A
 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@@ -208,7 +208,7 @@ The `get_one()` method returns a single component and supports pattern matching
 - exclusion patterns like `comp.get_one(name="!unet")` to exclude components named "unet"
 - OR patterns like `comp.get_one(name="unet|vae")` to match either "unet" OR "vae". 

-You can also filter by collection with `comp.get_one(name="unet", collection="sdxl")` or by load_id. If multiple components match, `get_one()` throws an error.
+Optionally, You can add collection and load_id as filters e.g. `comp.get_one(name="unet", collection="sdxl")`. If multiple components match, `get_one()` throws an error.

 Another useful method is `get_components_by_names()`, which takes a list of names and returns a dictionary mapping names to components. This is particularly helpful with modular pipelines since they provide lists of required component names, and the returned dictionary can be directly passed to `pipeline.update_components()`.

@@ -260,7 +260,7 @@ Now let's load all default components and then create a second pipeline that reu

 ```py
 # Load all default components 
->>> pipe.load_default_components()`
+>>> pipe.load_default_components()

 # Create a second pipeline using the same Components Manager but with a different collection
 >>> pipe2 = ModularPipeline.from_pretrained("YiYiXu/modular-demo-auto", components_manager=comp, collection="test2")
@@ -282,7 +282,7 @@ As mentioned earlier, `ModularPipeline` has a property `null_component_names` th

 The warnings that follow are expected and indicate that the Components Manager is correctly identifying that these components already exist and will be reused rather than creating duplicates:

-```
+```out
 ComponentsManager: component 'text_encoder' already exists as 'text_encoder_139917586016400'
 ComponentsManager: component 'text_encoder_2' already exists as 'text_encoder_2_139917699973424'
 ComponentsManager: component 'tokenizer' already exists as 'tokenizer_139917580599504'
@@ -293,7 +293,7 @@ ComponentsManager: component 'vae' already exists as 'vae_139917722459040'
 ComponentsManager: component 'scheduler' already exists as 'scheduler_139916266559408'
 ComponentsManager: component 'controlnet' already exists as 'controlnet_139917722454432'
 ```
-```
+

 The pipeline is now fully loaded:

@@ -359,9 +359,9 @@ When enabled, all models start on CPU. The manager moves models to the device ri

 Now that we've covered the basics of the Components Manager, let's walk through a practical example that shows how to build workflows in a modular setting and use the Components Manager to reuse components across multiple pipelines. This example demonstrates the true power of Modular Diffusers by working with multiple pipelines that can share components. 

-In this example, we'll generate latents from a text-to-image pipeline, then refine them with an image-to-image pipeline. We will also use Lora and IP-Adapter.
+In this example, we'll generate latents from a text-to-image pipeline, then refine them with an image-to-image pipeline.

-Let's create a modular text-to-image workflow by separating it into three components: `text_blocks` for encoding prompts, `t2i_blocks` for generating latents, and `decoder_blocks` for creating final images.
+Let's create a modular text-to-image workflow by separating it into three workflows: `text_blocks` for encoding prompts, `t2i_blocks` for generating latents, and `decoder_blocks` for creating final images.

 ```py
 import torch
@@ -374,7 +374,9 @@ text_blocks = t2i_blocks.sub_blocks.pop("text_encoder")
 decoder_blocks = t2i_blocks.sub_blocks.pop("decode")
 ```

-Now we will convert them into runnalbe pipelines and set up the Components Manager with auto offloading and organize components under a "t2i" collection:
+Now we will convert them into runnalbe pipelines and set up the Components Manager with auto offloading and organize components under a "t2i" collection
+
+Since we now have 3 different workflows that share components, we create a separate pipeline that serves as a dedicated loader to load all the components, register them to the component manager, and then reuse them across different workflows.

 ```py
 from diffusers import ComponentsManager, ModularPipeline
@@ -383,20 +385,21 @@ from diffusers import ComponentsManager, ModularPipeline
 components = ComponentsManager()
 components.enable_auto_cpu_offload(device="cuda")

-# Create pipelines and load components
+# Create a new pipeline to load the components
 t2i_repo = "YiYiXu/modular-demo-auto"
 t2i_loader_pipe = ModularPipeline.from_pretrained(t2i_repo, components_manager=components, collection="t2i")

+# convert the 3 blocks into pipelines and attach the same components manager to all 3
 text_node = text_blocks.init_pipeline(t2i_repo, components_manager=components)
 decoder_node = decoder_blocks.init_pipeline(t2i_repo, components_manager=components)
 t2i_pipe = t2i_blocks.init_pipeline(t2i_repo, components_manager=components)
 ```

-Load all components into the Components Manager under the "t2i" collection:
+Load all components into the loader pipeline, they should all be automatically registered to Components Manager under the "t2i" collection:

 ```py
 # Load all components (including IP-Adapter and ControlNet for later use)
-t2i_loader_pipe.load_components(names=t2i_loader_pipe.pretrained_component_names, torch_dtype=torch.float16)
+t2i_loader_pipe.load_default_components(torch_dtype=torch.float16)
 ```

 Now distribute the loaded components to each pipeline:
@@ -432,7 +435,7 @@ image.save("modular_part2_t2i.png")
 Let's add a LoRA:

 ```py
-# Load LoRA weights - only the UNet gets the adapter
+# Load LoRA weights 
 >>> t2i_loader_pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy_face")
 >>> components
 Components:
@@ -464,7 +467,8 @@ refiner_blocks = SequentialPipelineBlocks.from_blocks_dict(ALL_BLOCKS["img2img"]
 refiner_blocks.sub_blocks.pop("image_encoder")
 refiner_blocks.sub_blocks.pop("decode")

-# Create refiner pipeline with different repo and collection
+# Create refiner pipeline with different repo and collection,
+# Attach the same component manager to it
 refiner_repo = "YiYiXu/modular_refiner"
 refiner_pipe = refiner_blocks.init_pipeline(refiner_repo, components_manager=components, collection="refiner")
 ```
@@ -266,27 +266,27 @@ class SDXLDiffDiffLoopBeforeDenoiser(PipelineBlock):
            "Step within the denoising loop for differential diffusion that prepare the latent input for the denoiser"
        )

-    @property
-    def inputs(self) -> List[Tuple[str, Any]]:
-        return [
-            InputParam("denoising_start"),
-        ]
+   @property
+   def inputs(self) -> List[Tuple[str, Any]]:
+       return [
+           InputParam("denoising_start"),
+       ]

    @property
    def intermediate_inputs(self) -> List[str]:
        return [
            InputParam("latents", required=True, type_hint=torch.Tensor),
-            InputParam("original_latents", type_hint=torch.Tensor),
-            InputParam("diffdiff_masks", type_hint=torch.Tensor),
+           InputParam("original_latents", type_hint=torch.Tensor),
+           InputParam("diffdiff_masks", type_hint=torch.Tensor),
        ]

    def __call__(self, components, block_state, i, t):
-        # Apply differential diffusion logic
-        if i == 0 and block_state.denoising_start is None:
-            block_state.latents = block_state.original_latents[:1]
-        else:
-            block_state.mask = block_state.diffdiff_masks[i].unsqueeze(0).unsqueeze(1)
-            block_state.latents = block_state.original_latents[i] * block_state.mask + block_state.latents * (1 - block_state.mask)
+       # Apply differential diffusion logic
+       if i == 0 and block_state.denoising_start is None:
+           block_state.latents = block_state.original_latents[:1]
+       else:
+           block_state.mask = block_state.diffdiff_masks[i].unsqueeze(0).unsqueeze(1)
+           block_state.latents = block_state.original_latents[i] * block_state.mask + block_state.latents * (1 - block_state.mask)
        
        # ... rest of existing logic ...
 ```
@@ -361,9 +361,9 @@ Run the example now, you should see an apple with its right half transformed int

 ## Adding IP-adapter

-We provide an auto IP-adapter block that you can plug-and-play into your modular workflow. It's an `AutoPipelineBlocks`, so it will only run when the user passes an IP adapter image. In this tutorial, we'll focus on how to package it into your differential diffusion workflow. To learn more about `AutoPipelineBlocks`, see [here](https://huggingface.co/docs/diffusers/modular_diffusers/write_own_pipeline_block#autopipelineblocks)
+We provide an auto IP-adapter block that you can plug-and-play into your modular workflow. It's an `AutoPipelineBlocks`, so it will only run when the user passes an IP adapter image. In this tutorial, we'll focus on how to package it into your differential diffusion workflow. To learn more about `AutoPipelineBlocks`, see [here](./auto_pipeline_blocks.md)

-We talked about how to add IP-adapter into your workflow in the [getting-started guide](https://huggingface.co/docs/diffusers/modular_diffusers/quicktour#ip-adapter). Let's just go ahead to create the IP-adapter block.
+We talked about how to add IP-adapter into your workflow in the [Modular Pipeline Guide](./modular_pipeline.md). Let's just go ahead to create the IP-adapter block.

 ```py
 >>> from diffusers.modular_pipelines.stable_diffusion_xl.encoders import StableDiffusionXLAutoIPAdapterStep
@@ -496,7 +496,7 @@ From looking at the code workflow: differential diffusion only modifies the "bef

 Intuitively, these two techniques are orthogonal and should combine naturally: differential diffusion controls how much the inference process can deviate from the original in each region, while ControlNet controls in what direction that change occurs.

-With this understanding, let's assemble the `SDXLDiffDiffControlNetDenoiseStep`:
+With this understanding, let's assemble the diffdiff-controlnet loop by combining the diffdiff before-denoiser step and controlnet denoiser step.

 ```py
 >>> class SDXLDiffDiffControlNetDenoiseStep(StableDiffusionXLDenoiseLoopWrapper):
@@ -617,7 +617,7 @@ to use
 ```
 ## Creating a Modular Repo

-You can easily share your differential diffusion workflow on the hub, by creating a modular repo like this https://huggingface.co/YiYiXu/modular-diffdiff
+You can easily share your differential diffusion workflow on the Hub by creating a modular repo. This is one created using the code we just wrote together: https://huggingface.co/YiYiXu/modular-diffdiff

 To create a Modular Repo and share on hub, you just need to run `save_pretrained()` along with the `push_to_hub=True` flag. Note that if your pipeline contains custom block, you need to manually upload the code to the hub. But we are working on a command line tool to help you upload it very easily.

@@ -641,7 +641,7 @@ With a modular repo, it is very easy for the community to use the workflow you j
 >>> components.enable_auto_cpu_offload()
 ```

-see more usage example on model card
+see more usage example on model card.

 ## deploy a mellon node

@@ -0,0 +1,194 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# LoopSequentialPipelineBlocks
+
+<Tip warning={true}>
+
+🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
+
+</Tip>
+
+`LoopSequentialPipelineBlocks` is a subclass of `ModularPipelineBlocks`. It is a multi-block that composes other blocks together in a loop, creating iterative workflows where blocks run multiple times with evolving state. It's particularly useful for denoising loops requiring repeated execution of the same blocks.
+
+<Tip>
+
+Other types of multi-blocks include [SequentialPipelineBlocks](./sequential_pipeline_blocks.md) (for linear workflows) and [AutoPipelineBlocks](./auto_pipeline_blocks.md) (for conditional block selection). For information on creating individual blocks, see the [PipelineBlock guide](./pipeline_block.md).
+
+Additionally, like all `ModularPipelineBlocks`, `LoopSequentialPipelineBlocks` are definitions/specifications, not runnable pipelines. You need to convert them into a `ModularPipeline` to actually execute them. For information on creating and running pipelines, see the [Modular Pipeline guide](modular_pipeline.md).
+
+</Tip>
+
+You could create a loop using `PipelineBlock` like this:
+
+```python
+class DenoiseLoop(PipelineBlock):
+    def __call__(self, components, state):
+        block_state = self.get_block_state(state)
+        for t in range(block_state.num_inference_steps):
+            # ... loop logic here
+            pass
+        self.set_block_state(state, block_state)
+        return components, state
+```
+
+But in this tutorial, we will focus on how to use `LoopSequentialPipelineBlocks` to create a "composable" denoising loop where you can add or remove blocks within the loop or reuse the same loop structure with different block combinations.
+
+It involves two parts: a **loop wrapper** and **loop blocks**
+
+* The **loop wrapper** (`LoopSequentialPipelineBlocks`) defines the loop structure, e.g. it defines the iteration variables, and loop configurations such as progress bar.
+
+* The **loop blocks** are basically standard pipeline blocks you add to the loop wrapper.
+  - they run sequentially for each iteration of the loop
+  - they receive the current iteration index as an additional parameter
+  - they share the same block_state throughout the entire loop
+
+Unlike regular `SequentialPipelineBlocks` where each block gets its own state, loop blocks share a single state that persists and evolves across iterations.
+
+We will build a simple loop block to demonstrate these concepts. Creating a loop block involves three steps:
+1. defining the loop wrapper class
+2. creating the loop blocks
+3. adding the loop blocks to the loop wrapper class to create the loop wrapper instance
+
+**Step 1: Define the Loop Wrapper**
+
+To create a `LoopSequentialPipelineBlocks` class, you need to define:
+
+* `loop_inputs`: User input variables (equivalent to `PipelineBlock.inputs`)
+* `loop_intermediate_inputs`: Intermediate variables needed from the mutable pipeline state (equivalent to `PipelineBlock.intermediates_inputs`)
+* `loop_intermediate_outputs`: New intermediate variables this block will add to the mutable pipeline state (equivalent to `PipelineBlock.intermediates_outputs`)
+* `__call__` method: Defines the loop structure and iteration logic
+
+Here is an example of a loop wrapper:
+
+```py
+import torch
+from diffusers.modular_pipelines import LoopSequentialPipelineBlocks, PipelineBlock, InputParam, OutputParam
+
+class LoopWrapper(LoopSequentialPipelineBlocks):
+    model_name = "test"
+    @property
+    def description(self):
+        return "I'm a loop!!"
+    @property
+    def loop_inputs(self):
+        return [InputParam(name="num_steps")]
+    @torch.no_grad()
+    def __call__(self, components, state):
+        block_state = self.get_block_state(state)
+        # Loop structure - can be customized to your needs
+        for i in range(block_state.num_steps):
+            # loop_step executes all registered blocks in sequence
+            components, block_state = self.loop_step(components, block_state, i=i)
+        self.set_block_state(state, block_state)
+        return components, state
+```
+
+**Step 2: Create Loop Blocks**
+
+Loop blocks are standard `PipelineBlock`s, but their `__call__` method works differently:
+* It receives the iteration variable (e.g., `i`) passed by the loop wrapper
+* It works directly with `block_state` instead of pipeline state
+* No need to call `self.get_block_state()` or `self.set_block_state()`
+
+```py
+class LoopBlock(PipelineBlock):
+    # this is used to identify the model family, we won't worry about it in this example
+    model_name = "test"
+    @property
+    def inputs(self):
+        return [InputParam(name="x")]
+    @property
+    def intermediate_outputs(self):
+        # outputs produced by this block
+        return [OutputParam(name="x")]
+    @property
+    def description(self):
+        return "I'm a block used inside the `LoopWrapper` class"
+    def __call__(self, components, block_state, i: int):
+        block_state.x += 1
+        return components, block_state
+```
+
+**Step 3: Combine Everything**
+
+Finally, assemble your loop by adding the block(s) to the wrapper:
+
+```py
+loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock})
+```
+
+Now you've created a loop with one step:
+
+```py
+>>> loop
+LoopWrapper(
+  Class: LoopSequentialPipelineBlocks
+
+  Description: I'm a loop!!
+
+  Sub-Blocks:
+    [0] block1 (LoopBlock)
+       Description: I'm a block used inside the `LoopWrapper` class
+
+)
+```
+
+It has two inputs: `x` (used at each step within the loop) and `num_steps` used to define the loop.
+
+```py
+>>> print(loop.doc)
+class LoopWrapper
+
+  I'm a loop!!
+
+  Inputs:
+
+      x (`None`, *optional*):
+
+      num_steps (`None`, *optional*):
+
+  Outputs:
+
+      x (`None`):
+```
+
+**Running the Loop:**
+
+```py
+# run the loop
+loop_pipeline = loop.init_pipeline()
+x = loop_pipeline(num_steps=10, x=0, output="x")
+assert x == 10
+```
+
+**Adding Multiple Blocks:**
+
+We can add multiple blocks to run within each iteration. Let's run the loop block twice within each iteration:
+
+```py
+loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock(), "block2": LoopBlock})
+loop_pipeline = loop.init_pipeline()
+x = loop_pipeline(num_steps=10, x=0, output="x")
+assert x == 20  # Each iteration runs 2 blocks, so 10 iterations * 2 = 20
+```
+
+**Key Differences from SequentialPipelineBlocks:**
+
+The main difference is that loop blocks share the same `block_state` across all iterations, allowing values to accumulate and evolve throughout the loop. Loop blocks could receive additional arguments (like the current iteration index) depending on the loop wrapper's implementation, since the wrapper defines how loop blocks are called. You can easily add, remove, or reorder blocks within the loop without changing the loop logic itself.
+
+The officially supported denoising loops in Modular Diffusers are implemented using `LoopSequentialPipelineBlocks`. You can explore the actual implementation to see how these concepts work in practice:
+
+```py
+from diffusers.modular_pipelines.stable_diffusion_xl.denoise import StableDiffusionXLDenoiseStep
+StableDiffusionXLDenoiseStep()
+```
@@ -0,0 +1,59 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# PipelineState and BlockState
+
+<Tip warning={true}>
+
+🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
+
+</Tip>
+
+In Modular Diffusers, `PipelineState` and `BlockState` are the core data structures that enable blocks to communicate and share data. The concept is fundamental to understand how blocks interact with each other and the pipeline system.
+
+In the modular diffusers system, `PipelineState` acts as the global state container that all pipeline blocks operate on. It maintains the complete runtime state of the pipeline and provides a structured way for blocks to read from and write to shared data.
+
+A `PipelineState` consists of two distinct states:
+
+- **The immutable state** (i.e. the `inputs` dict) contains a copy of values provided by users. Once a value is added to the immutable state, it cannot be changed. Blocks can read from the immutable state but cannot write to it.
+
+- **The mutable state** (i.e. the `intermediates` dict) contains variables that are passed between blocks and can be modified by them.
+
+Here's an example of what a `PipelineState` looks like:
+
+```py
+PipelineState(
+  inputs={
+    'prompt': 'a cat'
+    'guidance_scale': 7.0
+    'num_inference_steps': 25
+  },
+  intermediates={
+    'prompt_embeds': Tensor(dtype=torch.float32, shape=torch.Size([1, 1, 1, 1]))
+    'negative_prompt_embeds': None
+  },
+)
+```
+
+Each pipeline blocks define what parts of that state they can read from and write to through their `inputs`, `intermediate_inputs`, and `intermediate_outputs` properties. At run time, they gets a local view (`BlockState`) of the relevant variables it needs from `PipelineState`, performs its operations, and then updates `PipelineState` with any changes.
+
+For example, if a block defines an input `image`, inside the block's `__call__` method, the `BlockState` would contain:
+
+```py
+BlockState(
+    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494640>
+)
+```
+
+You can access the variables directly as attributes: `block_state.image`.
+
+We will explore more on how blocks interact with pipeline state through their `inputs`, `intermediate_inputs`, and `intermediate_outputs` properties, see the [PipelineBlock guide](./pipeline_block.md).
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->

-# Getting Started with Modular Diffusers: A Comprehensive Overview
+# ModularPipeline

 <Tip warning={true}>

@@ -18,32 +18,33 @@ specific language governing permissions and limitations under the License.

 </Tip>

-With Modular Diffusers, we introduce a unified pipeline system that simplifies how you work with diffusion models. Instead of creating separate pipelines for each task, Modular Diffusers lets you:
+`ModularPipeline` is the main interface for end users to run pipelines in Modular Diffusers. It takes pipeline blocks and converts them into a runnable pipeline that can load models and execute the computation steps.

-**Write Only What's New**: You won't need to write an entire pipeline from scratch every time you have a new use case. You can create pipeline blocks just for your new workflow's unique aspects and reuse existing blocks for existing functionalities. 
+In this guide, we will focus on how to build pipelines using the blocks we officially support at diffusers 🧨. We'll cover how to use predefined blocks and convert them into a `ModularPipeline` for execution.

-**Assemble Like LEGO®**: You can mix and match between blocks in flexible ways. This allows you to write dedicated blocks unique to specific workflows, and then assemble different blocks into a pipeline that can be used more conveniently for multiple workflows. 
+<Tip>

-In this guide, we will focus on how to build end-to-end pipelines using blocks we officially support at diffusers 🧨! We will show you how to write your own pipeline blocks and go into more details on how they work under the hood in this [guide](./write_own_pipeline_block.md). For advanced users who want to build complete workflows from scratch, we provide an end-to-end example in the [Developer Guide](./end_to_end.md) that covers everything from writing custom pipeline blocks to deploying your workflow as a UI node.
+This guide shows you how to use predefined blocks. If you want to learn how to create your own pipeline blocks, see the [PipelineBlock guide](pipeline_block.md) for creating individual blocks, and the multi-block guides for connecting them together:
+- [SequentialPipelineBlocks](sequential_pipeline_blocks.md) (for linear workflows)
+- [LoopSequentialPipelineBlocks](loop_sequential_pipeline_blocks.md) (for iterative workflows)  
+- [AutoPipelineBlocks](auto_pipeline_blocks.md) (for conditional workflows)

-Let's get started! The Modular Diffusers Framework consists of three main components:
- ModularPipelineBlocks: Building blocks for your workflow, each block defines inputs/outputs and computation steps. These are just definitions and not runnable.
- PipelineState & BlockState: Store and manage data as it flows through the pipeline.
- ModularPipeline: Loads models and runs the computation steps. You convert blocks to pipelines to make them executable.
+For information on how data flows through pipelines, see the [PipelineState and BlockState guide](modular_diffusers_states.md).

-## ModularPipelineBlocks
+</Tip>

-Pipeline blocks are the fundamental building blocks of the Modular Diffusers system. All pipeline blocks inherit from the base class `ModularPipelineBlocks`, including:

- [`PipelineBlock`]: The most granular block - you define the computation logic.
+## Create ModularPipelineBlocks
+
+In Modular Diffusers system, you build pipelines using Pipeline blocks. Pipeline Blocks are fundamental building blocks - they define what components, inputs/outputs, and computation logics are needed. They are designed to be assembled into workflows for tasks such as image generation, video creation, and inpainting. But they are just definitions and don't actually run anything. To execute blocks, you need to put them into a `ModularPipeline`. We'll first learn how to create predefined blocks here before talking about how to run them using `ModularPipeline`. 
+
+All pipeline blocks inherit from the base class `ModularPipelineBlocks`, including:
+
+- [`PipelineBlock`]: The most granular block - you define the input/output/components requirements and computation logic.
 - [`SequentialPipelineBlocks`]: A multi-block composed of multiple blocks that run sequentially, passing outputs as inputs to the next block.
 - [`LoopSequentialPipelineBlocks`]: A special type of `SequentialPipelineBlocks` that runs the same sequence of blocks multiple times (loops), typically used for iterative processes like denoising steps in diffusion models.
 - [`AutoPipelineBlocks`]: A multi-block composed of multiple blocks that are selected at runtime based on the inputs.

-All blocks have a consistent interface defining their requirements (components, configs, inputs, outputs) and computation logic. They can be defined standalone or combined into larger blocks - They are designed to be assembled into workflows for tasks such as image generation, video creation, and inpainting. However, blocks aren't runnable on thier own and they need to be converted into a a ModularPipeline to actually run. 
-
-**Blocks vs Pipelines**: Blocks are just definitions - they define what components, inputs/outputs, and computation logics are needed, but they don't actually run anything. To execute blocks, you need to put them into a `ModularPipeline`. See the [ModularPipeline from ModularPipelineBlocks](#modularpipeline-from-modularpipelineblocks) section for how to create and run pipelines.
-
 It is very easy to use a `ModularPipelineBlocks` officially supported in 🧨 Diffusers

 ```py
@@ -74,9 +75,7 @@ StableDiffusionXLTextEncoderStep(
 )
 ```

-More commonly, you need multiple blocks to build your workflow. You can create a `SequentialPipelineBlocks` using block class presets from 🧨 Diffusers.
-
-`TEXT2IMAGE_BLOCKS` is a predefined dictionary containing all the blocks needed for a complete text-to-image pipeline (text encoding, denoising, decoding, etc.). We will see more details soon.
+More commonly, you need multiple blocks to build your workflow. You can create a `SequentialPipelineBlocks` using block class presets from 🧨 Diffusers. `TEXT2IMAGE_BLOCKS` is a dict containing all the blocks needed for text-to-image generation.

 ```py
 from diffusers.modular_pipelines import SequentialPipelineBlocks
@@ -84,7 +83,7 @@ from diffusers.modular_pipelines.stable_diffusion_xl import TEXT2IMAGE_BLOCKS
 t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)
 ```

-This creates a `SequentialPipelineBlocks`, which is a multi-block composed of other blocks. Unlike single blocks (like the `text_encoder_block` we saw earlier), this multi-block has a `sub_blocks` attribute that contains the sub-blocks (text_encoder, input, set_timesteps, prepare_latents, prepare_added_con, denoise, decode). Its requirements for components, inputs, and intermediate inputs are combined from these blocks that compose it. At runtime, it executes its sub-blocks sequentially and passes the pipeline state from one block to another. 
+This creates a `SequentialPipelineBlocks`. Unlike the `text_encoder_block` we saw earlier, this is a multi-block and its `sub_blocks` attribute contains a list of other blocks (text_encoder, input, set_timesteps, prepare_latents, prepare_added_con, denoise, decode). Its requirements for components, inputs, and intermediate inputs are combined from these blocks that compose it. At runtime, it executes its sub-blocks sequentially and passes the pipeline state from one block to another. 

 ```py
 >>> t2i_blocks
@@ -145,7 +144,7 @@ SequentialPipelineBlocks(
 )
 ```

-The block classes preset (`TEXT2IMAGE_BLOCKS`) we used is just a dictionary that maps names to ModularPipelineBlocks classes
+This is the block classes preset (`TEXT2IMAGE_BLOCKS`) we used: It is just a dictionary that maps names to ModularPipelineBlocks classes

 ```py
 >>> TEXT2IMAGE_BLOCKS
@@ -179,9 +178,9 @@ Note that both the block classes preset and the `sub_blocks` attribute are `Inse

 **Add a block:**
 ```py
-# BLOCKS is a block class preset, you need to add class to it
+# BLOCKS is dict of block classes, you need to add class to it
 BLOCKS.insert("block_name", BlockClass, index)
-# Add a block instance to the `sub_blocks` attribute
+# sub_blocks attribute contains instance, add a block instance to the  attribute
 t2i_blocks.sub_blocks.insert("block_name", block_instance, index)
 ```

@@ -197,7 +196,7 @@ text_encoder_block = t2i_blocks.sub_blocks.pop("text_encoder")
 ```py
 # Replace block class in preset
 BLOCKS["prepare_latents"] = CustomPrepareLatents
-# Replace in sub_blocks attribute
+# Replace in sub_blocks attribute using an block instance
 t2i_blocks.sub_blocks["prepare_latents"] = CustomPrepareLatents()
 ```

@@ -209,7 +208,9 @@ Let's make a new block classes preset by insert IP-Adapter at index 0 (before th
 ```py
 from diffusers.modular_pipelines.stable_diffusion_xl import StableDiffusionXLAutoIPAdapterStep
 CUSTOM_BLOCKS = TEXT2IMAGE_BLOCKS.copy()
+# CUSTOM_BLOCKS is now a preset including ip_adapter
 CUSTOM_BLOCKS.insert("ip_adapter", StableDiffusionXLAutoIPAdapterStep, 0)
+# create a blocks isntance from the preset
 custom_blocks = SequentialPipelineBlocks.from_blocks_dict(CUSTOM_BLOCKS)
 ```

@@ -299,27 +300,16 @@ ALL_BLOCKS = {

 </Tip>

-We will not go over how to write your own ModularPipelineBlocks but you can learn more about it [here](./write_own_pipeline_block.md).
+This covers the essentials of pipeline blocks! Like we have already mentioned, **pipeline blocks are not runnable by themselves**. They are essentially **"definitions"** - they define the specifications and computational steps for a pipeline, but they do not contain any model states. To actually run them, you need to convert them into a `ModularPipeline` object.

-This covers the essentials of pipeline blocks! You may have noticed that we haven't discussed how to load or run pipeline blocks - that's because **pipeline blocks are not runnable by themselves**. They are essentially **"definitions"** - they define the specifications and computational steps for a pipeline, but they do not contain any model states. To actually run them, you need to convert them into a `ModularPipeline` object.

-## PipelineState & BlockState
+## Modular Repo

-`PipelineState` and `BlockState` manage dataflow between pipeline blocks. `PipelineState` acts as the global state container that `ModularPipelineBlocks` operate on - each block gets a local view (`BlockState`) of the relevant variables it needs from `PipelineState`, performs its operations, and then updates `PipelineState` as needed.
+To convert blocks into a runnable pipeline, you may need a repository if your blocks contain **pretrained components** (models with checkpoints that need to be loaded from the Hub). Pipeline blocks define what components they need (like a UNet, text encoder, etc.), as well as how to create them: components can be either created using **from_pretrained** method (with checkpoints) or **from_config** (initialized from scratch with default configuration, usually stateless like a guider or scheduler). 

-<Tip>
+If your pipeline contains **pretrained components**, you typically need to use a repository to provide the loading specifications and metadata.

-You typically don't need to manually create or manage these state objects. The `ModularPipeline` automatically creates and manages them for you. However, understanding their roles is important for developing custom pipeline blocks.
-
-</Tip>
-
-## ModularPipeline
-
-`ModularPipeline` is the main interface to create and execute pipelines in the Modular Diffusers system.
-
-### Modular Repo
-
-`ModularPipeline` only works with modular repositories. You can find an example modular repo [here](https://huggingface.co/YiYiXu/modular-diffdiff).
+`ModularPipeline` works specifically with modular repositories, which offer more flexibility in component loading compared to traditional repositories. You can find an example modular repo [here](https://huggingface.co/YiYiXu/modular-diffdiff).

 A `DiffusionPipeline` defines `model_index.json` to configure its components. However, repositories for Modular Diffusers work with `modular_model_index.json`. Let's walk through the differences here.

@@ -338,13 +328,13 @@ In `modular_model_index.json`, each component entry contains 3 elements: `(libra

 ```py
 "text_encoder": [
-  null,  # library (same as model_index.json)
-  null,  # class (same as model_index.json)
+  null,  # library of actual loaded component (same as in model_index.json)
+  null,  # class of actual loaded componenet (same as in model_index.json)
  {      # loading specs map (unique to modular_model_index.json)
    "repo": "stabilityai/stable-diffusion-xl-base-1.0",  # can be a different repo
    "revision": null,
    "subfolder": "text_encoder",
-    "type_hint": [  # (library, class) for the expected component class
+    "type_hint": [  # (library, class) for the expected component
      "transformers",  
      "CLIPTextModel"
    ],
@@ -356,60 +346,61 @@ In `modular_model_index.json`, each component entry contains 3 elements: `(libra
 Unlike standard repositories where components must be in subfolders within the same repo, modular repositories can fetch components from different repositories based on the `loading_specs_dict`. e.g. the `text_encoder` component will be fetched from the "text_encoder" folder in `stabilityai/stable-diffusion-xl-base-1.0` while other components come from different repositories.


-### Creating a `ModularPipeline` from `ModularPipelineBlocks`
+## Creating a `ModularPipeline` from `ModularPipelineBlocks`

 Each `ModularPipelineBlocks` has an `init_pipeline` method that can initialize a `ModularPipeline` object based on its component and configuration specifications.

-Let's convert our `t2i_blocks` (which we created earlier) into a runnable `ModularPipeline`:
+Let's convert our `t2i_blocks` (which we created earlier) into a runnable `ModularPipeline`. We'll use a `ComponentsManager` to handle device placement, memory management, and component reuse automatically:

 ```py
 # We already have this from earlier
 t2i_blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)

 # Now convert it to a ModularPipeline
+from diffusers import ComponentsManager
 modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
-t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id)
+components = ComponentsManager()
+t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id, components_manager=components)
 ```

-The `init_pipeline()` method creates a ModularPipeline and loads component specifications from the repository's `modular_model_index.json` file, but doesn't load the actual models yet.
-
 <Tip>

-💡 We recommend using `ModularPipeline` with Component Manager by passing a `components_manager`:
-
-```py
->>> components = ComponentsManager()
->>> pipeline = blocks.init_pipeline(modular_repo_id, components_manager=components)
-```
-
-This helps you to:
-1. Detect and manage duplicated models (warns when trying to register an existing model)
-2. Easily reuse components across different pipelines
-3. Apply offloading strategies across multiple pipelines
-
-You can read more about [Components Manager](./components_manager.md)
+💡 **ComponentsManager** is the model registry and management system in diffusers, it track all the models in one place and let you add, remove and reuse them across different workflows in most efficient way. Without it, you'd need to manually manage GPU memory, device placement, and component sharing between workflows. See the [Components Manager guide](components_manager.md) for detailed information.

 </Tip>

+The `init_pipeline()` method creates a ModularPipeline and loads component specifications from the repository's `modular_model_index.json` file, but doesn't load the actual models yet.

-### Creating a `ModularPipeline` with `from_pretrained`
+
+## Creating a `ModularPipeline` with `from_pretrained`

 You can create a `ModularPipeline` from a HuggingFace Hub repository with `from_pretrained` method, as long as it's a modular repo:

 ```py
-from diffusers import ModularPipeline
-pipeline = ModularPipeline.from_pretrained( "YiYiXu/modular-loader-t2i-0704")
+from diffusers import ModularPipeline, ComponentsManager
+components = ComponentsManager()
+pipeline = ModularPipeline.from_pretrained("YiYiXu/modular-loader-t2i-0704", components_manager=components)
 ```

 Loading custom code is also supported:

 ```py
-from diffusers import ModularPipeline
+from diffusers import ModularPipeline, ComponentsManager
+components = ComponentsManager()
 modular_repo_id = "YiYiXu/modular-diffdiff-0704"
-diffdiff_pipeline = ModularPipeline.from_pretrained(modular_repo_id, trust_remote_code=True)
+diffdiff_pipeline = ModularPipeline.from_pretrained(modular_repo_id, trust_remote_code=True, components_manager=components)
 ```

-This modular repository contains custom code. The [`config.json`](https://huggingface.co/YiYiXu/modular-diffdiff-0704/blob/main/config.json) file defines a custom `DiffDiffBlocks` class and points to its implementation:
+This modular repository contains custom code. The folder contains these files:
+
+```
+modular-diffdiff-0704/
+├── block.py                    # Custom pipeline blocks implementation
+├── config.json                 # Pipeline configuration and auto_map
+└── modular_model_index.json    # Component loading specifications
+```
+
+The [`config.json`](https://huggingface.co/YiYiXu/modular-diffdiff-0704/blob/main/config.json) file defines a custom `DiffDiffBlocks` class and points to its implementation:

 ```json
 {
@@ -424,7 +415,7 @@ The `auto_map` tells the pipeline where to find the custom blocks definition - i

 When `diffdiff_pipeline.blocks` is created, it's based on the `DiffDiffBlocks` definition from the custom code in the repository, allowing you to use specialized blocks that aren't part of the standard diffusers library.

-### Loading components into a `ModularPipeline`
+## Loading components into a `ModularPipeline`

 Unlike `DiffusionPipeline`, when you create a `ModularPipeline` instance (whether using `from_pretrained` or converting from pipeline blocks), its components aren't loaded automatically. You need to explicitly load model components using `load_default_components` or `load_components(names=..,)`:

@@ -551,7 +542,7 @@ StableDiffusionXLModularPipeline {
 }
 ```

-You can see all the components that will be loaded using `from_pretrained` method are listed as entries. Each entry contains 3 elements: `(library, class, loading_specs_dict)`:
+You can see all the **pretrained components** that will be loaded using `from_pretrained` method are listed as entries. Each entry contains 3 elements: `(library, class, loading_specs_dict)`:

 - **`library` and `class`**: Show the actual loaded component info. If `null`, the component is not loaded yet.
 - **`loading_specs_dict`**: Contains all the information needed to load the component (repo, subfolder, variant, etc.)
@@ -584,9 +575,11 @@ There are also a few properties that can provide a quick summary of component lo
 ['guider', 'image_processor']
 ```

-### Modifying Loading Specs
+From config components (like `guider` and `image_processor`) are not included in the pipeline output above because they don't need loading specs - they're already initialized during pipeline creation. You can see this because they're not listed in `null_component_names`.

-When you call `pipeline.load_components(names=)` or `pipeline.load_default_components()`, it uses the loading specs from the modular repository's `modular_model_index.json`. You can change where components are loaded from by default by modifying the `modular_model_index.json` in the repository. You can change any field in the loading specs: `repo`, `subfolder`, `variant`, `revision`, etc.
+## Modifying Loading Specs
+
+When you call `pipeline.load_components(names=)` or `pipeline.load_default_components()`, it uses the loading specs from the modular repository's `modular_model_index.json`. You can change where components are loaded from by modifying the `modular_model_index.json` in the repository. Just find the file on the Hub and click edit - you can change any field in the loading specs: `repo`, `subfolder`, `variant`, `revision`, etc.

 ```py
 # Original spec in modular_model_index.json
@@ -610,18 +603,31 @@ When you call `pipeline.load_components(names=)` or `pipeline.load_default_compo
 ]
 ```

-When you call `pipeline.load_components(...)`/`pipeline.load_default_components()`, it will now load from the new repository by default.
+Now if you create a pipeline using the same blocks and updated repository, it will by default load from the new repository.
+
+```py
+pipeline = ModularPipeline.from_pretrained("YiYiXu/modular-loader-t2i-0704", components_manager=components)
+pipeline.load_components(names="unet")
+```


-### Updating components in a `ModularPipeline`
+## Updating components in a `ModularPipeline`

 Similar to `DiffusionPipeline`, you can load components separately to replace the default ones in the pipeline. In Modular Diffusers, the approach depends on the component type:

- **Pretrained components** (`default_creation_method='from_pretrained'`): Must use `ComponentSpec` to load them, as they get tagged with a unique ID that encodes their loading parameters
- **Config components** (`default_creation_method='from_config'`): These are components that don't need loading specs - they're created during pipeline initialization with default config. To update them, you can either pass the object directly or pass a ComponentSpec directly (which will call `create()` under the hood).
+- **Pretrained components** (`default_creation_method='from_pretrained'`): Must use `ComponentSpec` to load them to update the existing one.
+- **Config components** (`default_creation_method='from_config'`): These are components that don't need loading specs - they're created during pipeline initialization with default config. To update them, you can either pass the object directly or pass a ComponentSpec directly.
+
+<Tip>
+
+💡 **Component Type Changes**: The component type (pretrained vs config-based) can change when you update components. These types are initially defined in pipeline blocks' `expected_components` field using `ComponentSpec` with `default_creation_method`. See the [Customizing Guidance Techniques](#customizing-guidance-techniques) section for examples of how this works in practice.
+
+</Tip>

 `ComponentSpec` defines how to create or load components and can actually create them using its `create()` method (for ConfigMixin objects) or `load()` method (wrapper around `from_pretrained()`). When a component is loaded with a ComponentSpec, it gets tagged with a unique ID that encodes its creation parameters, allowing you to always extract the original specification using `ComponentSpec.from_component()`.

+Now let's look at how to update pretrained components in practice:
+
 So instead of 

 ```py
@@ -629,7 +635,7 @@ from diffusers import UNet2DConditionModel
 import torch
 unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", variant="fp16", torch_dtype=torch.float16)
 ```
-You should do
+You should load your model like this

 ```py
 from diffusers import ComponentSpec, UNet2DConditionModel
@@ -637,13 +643,15 @@ unet_spec = ComponentSpec(name="unet",type_hint=UNet2DConditionModel, repo="stab
 unet2 = unet_spec.load(torch_dtype=torch.float16)
 ```

-The key difference is that the second unet (the one we load with `ComponentSpec`) retains its loading specs, so you can extract and recreate it:
+The key difference is that the second unet retains its loading specs, so you can extract the spec and recreate the unet:

 ```py
-# to extract spec, you can do spec.load() to recreate it
+# component -> spec
 >>> spec = ComponentSpec.from_component("unet", unet2)
 >>> spec
 ComponentSpec(name='unet', type_hint=<class 'diffusers.models.unets.unet_2d_condition.UNet2DConditionModel'>, description=None, config=None, repo='stabilityai/stable-diffusion-xl-base-1.0', subfolder='unet', variant='fp16', revision=None, default_creation_method='from_pretrained')
+# spec -> component
+>>> unet2_recreatd = spec.load(torch_dtype=torch.float16)
 ```

 To replace the unet in the pipeline
@@ -652,7 +660,7 @@ To replace the unet in the pipeline
 t2i_pipeline.update_components(unet=unet2)
 ```

-Not only is the `unet` component swapped, but its loading specs are also updated from "RunDiffusion/Juggernaut-XL-v9" to "stabilityai/stable-diffusion-xl-base-1.0". This means that if you save the pipeline now and load it back with `from_pretrained`, the new pipeline will by default load the SDXL original unet.
+Not only is the `unet` component swapped, but its loading specs are also updated from "RunDiffusion/Juggernaut-XL-v9" to "stabilityai/stable-diffusion-xl-base-1.0" in pipeline config. This means that if you save the pipeline now and load it back with `from_pretrained`, the new pipeline will by default load the SDXL original unet.

 ```
 >>> t2i_pipeline
@@ -700,7 +708,7 @@ ComponentSpec(

 </Tip>

-### Customizing Guidance Techniques
+## Customizing Guidance Techniques

 Guiders are implementations of different [classifier-free guidance](https://huggingface.co/papers/2207.12598) techniques that can be applied during the denoising process to improve generation quality, control, and adherence to prompts. They work by steering the model predictions towards desired directions and away from undesired directions. In diffusers, guiders are implemented as subclasses of `BaseGuidance`. They can easily be integrated into modular pipelines and provide a flexible way to enhance generation quality without modifying the underlying diffusion models.

@@ -737,6 +745,9 @@ ClassifierFreeGuidance {
 To change parameters of the same guider type (e.g., adjusting the `guidance_scale` for CFG), you have two options:

 **Option 1: Use ComponentSpec.create() method**
+
+You just need to pass the parameter with the new value to override the default one.
+
 ```python
 >>> guider_spec = t2i_pipeline.get_component_spec("guider")
 >>> guider = guider_spec.create(guidance_scale=10)
@@ -744,6 +755,9 @@ To change parameters of the same guider type (e.g., adjusting the `guidance_scal
 ```

 **Option 2: Pass ComponentSpec directly**
+
+Update the spec directly and pass it to `update_components()`.
+
 ```python
 >>> guider_spec = t2i_pipeline.get_component_spec("guider")
 >>> guider_spec.config["guidance_scale"] = 10
@@ -785,7 +799,6 @@ ModularPipeline.update_components: adding guider with new type: PerturbedAttenti

 <Tip>

-💡 **Component Loading Methods**: 
 - For `from_config` components (like guiders, schedulers): You can pass an object of required type OR pass a ComponentSpec directly (which calls `create()` under the hood)
 - For `from_pretrained` components (like models): You must use ComponentSpec to ensure proper tagging and loading

@@ -826,24 +839,68 @@ The component spec has also been updated to reflect the new guider type:

 ```py
 >>> t2i_pipeline.get_component_spec("guider")
-ComponentSpec(name='guider', type_hint=<class 'diffusers.guiders.perturbed_attention_guidance.PerturbedAttentionGuidance'>, description=None, config=FrozenDict([('guidance_scale', 5.0), ('perturbed_guidance_scale', 2.5), ('perturbed_guidance_start', 0.01), ('perturbed_guidance_stop', 0.2), ('perturbed_guidance_layers', None), ('perturbed_guidance_config', LayerSkipConfig(indices=[2, 9], fqn='mid_block.attentions.0.transformer_blocks', skip_attention=False, skip_attention_scores=True, skip_ff=False, dropout=1.0)), ('guidance_rescale', 0.0), ('use_original_formulation', False), ('start', 0.0), ('stop', 1.0), ('_use_default_values', ['use_original_formulation', 'perturbed_guidance_stop', 'stop', 'guidance_rescale', 'start', 'perturbed_guidance_layers', 'perturbed_guidance_start']), ('_class_name', 'PerturbedAttentionGuidance'), ('_diffusers_version', '0.35.0.dev0')]), repo=None, subfolder=None, variant=None, revision=None, default_creation_method='from_config')
+ComponentSpec(name='guider', type_hint=<class 'diffusers.guiders.perturbed_attention_guidance.PerturbedAttentionGuidance'>, description=None, config=FrozenDict([('guidance_scale', 5.0), ('perturbed_guidance_scale', 2.5), ('perturbed_guidance_start', 0.01), ('perturbed_guidance_stop', 0.2), ('perturbed_guidance_layers', None), ('perturbed_guidance_config', LayerSkipConfig(indices=[2, 9], fqn='mid_block.attentions.0.transformer_blocks', skip_attention=False, skip_attention_scores=True, skip_ff=False, dropout=1.0)), ('guidance_rescale', 0.0), ('use_original_formulation', False), ('start', 0.0), ('stop', 1.0), ('_use_default_values', ['perturbed_guidance_start', 'use_original_formulation', 'perturbed_guidance_layers', 'stop', 'start', 'guidance_rescale', 'perturbed_guidance_stop']), ('_class_name', 'PerturbedAttentionGuidance'), ('_diffusers_version', '0.35.0.dev0')]), repo=None, subfolder=None, variant=None, revision=None, default_creation_method='from_config')
 ```

-However, the "guider" is still not included in the pipeline config and will not be saved into the `modular_model_index.json` since it remains a `from_config` component: 
+The "guider" is still a `from_config` component: is still not included in the pipeline config and will not be saved into the `modular_model_index.json`.

 ```py
 >>> assert "guider" not in  t2i_pipeline.config
 ```

+However, you can change it to a `from_pretrained` component, which allows you to upload your customized guider to the Hub and load it into your pipeline.
+
+#### Loading Custom Guiders from Hub
+
+If you already have a guider saved on the Hub and a `modular_model_index.json` with the loading spec for that guider, it will automatically be changed to a `from_pretrained` component during pipeline initialization.
+
+For example, this `modular_model_index.json` includes loading specs for the guider:
+
+```json
+{
+  "guider": [
+    null,
+    null,
+    {
+      "repo": "YiYiXu/modular-loader-t2i-guider",
+      "revision": null,
+      "subfolder": "pag_guider",
+      "type_hint": [
+        "diffusers",
+        "PerturbedAttentionGuidance"
+      ],
+      "variant": null
+    }
+  ]
+}
+```
+
+When you use this repository to create a pipeline with the same blocks (that originally configured guider as a `from_config` component), the guider becomes a `from_pretrained` component. This means it doesn't get created during initialization, and after you call `load_default_components()`, it loads based on the spec - resulting in the PAG guider instead of the default CFG.
+
+```py
+t2i_pipeline = t2i_blocks.init_pipeline("YiYiXu/modular-doc-guider")
+assert t2i_pipeline.guider is None  # Not created during init
+t2i_pipeline.load_default_components()
+t2i_pipeline.guider  # Now loaded as PAG guider
+```
+
 #### Upload Custom Guider to Hub for Easy Loading & Sharing

-You can upload your customized guider to the Hub so that it can be loaded more easily:
+Now let's see how we can share the guider on the Hub and change it to a `from_pretrained` component.

 ```py
 guider.push_to_hub("YiYiXu/modular-loader-t2i-guider", subfolder="pag_guider")
 ```

-Voilà! Now you have a subfolder called `pag_guider` on that repository. Let's change our guider_spec to use `from_pretrained` as the default creation method and update the loading spec to use this subfolder we just created:
+Voilà! Now you have a subfolder called `pag_guider` on that repository. 
+
+You have a few options to make this guider available in your pipeline:
+
+1. **Directly modify the `modular_model_index.json`** to add a loading spec for the guider by pointing to a folder containing the desired guider config.
+
+2. **Use the `update_components` method** to change it to a `from_pretrained` component for your pipeline. This is easier if you just want to try it out with different repositories.
+
+Let's use the second approach and change our guider_spec to use `from_pretrained` as the default creation method and update the loading spec to use this subfolder we just created:

 ```python
 guider_spec = t2i_pipeline.get_component_spec("guider")
@@ -860,44 +917,14 @@ You will get a warning about changing the creation method:
 ModularPipeline.update_components: changing the default_creation_method of guider from from_config to from_pretrained.
 ```

-Now not only the `guider` component and its component_spec are updated, but so is the pipeline config. Let's push it to a new repository:
+Now not only the `guider` component and its component_spec are updated, but so is the pipeline config.
+
+If you want to change the default behavior for future pipelines, you can push the updated pipeline to the Hub. This way, when others use your repository, they'll get the PAG guider by default. However, this is optional - you don't have to do this if you just want to experiment locally.

 ```py
 t2i_pipeline.push_to_hub("YiYiXu/modular-doc-guider")
 ```

-If you check the `modular_model_index.json`, you'll see the guider is now included:
-
-```json
-{
-  "guider": [
-    "diffusers",
-    "PerturbedAttentionGuidance",
-    {
-      "repo": "YiYiXu/modular-loader-t2i-guider",
-      "revision": null,
-      "subfolder": "pag_guider",
-      "type_hint": [
-        "diffusers",
-        "PerturbedAttentionGuidance"
-      ],
-      "variant": null
-    }
-  ]
-}
-```
-
-Now when you create the pipeline from that repo directly, the `guider` is not automatically loaded anymore (since it's now a `from_pretrained` component), but when you run `load_default_components()`, the PAG guider will be loaded by default:
-
-```py
-t2i_pipeline = t2i_blocks.init_pipeline("YiYiXu/modular-doc-guider")
-assert t2i_pipeline.guider is None
-t2i_pipeline.load_default_components()
-t2i_pipeline.guider
-```
-
-Of course, you can also directly modify the `modular_model_index.json` to add a loading spec for the guider by pointing to a folder containing the desired guider config.
-

 <Tip>

@@ -907,7 +934,7 @@ Additionally, you can write your own guider implementations, for example, CFG Ze

 </Tip>

-### Running a `ModularPipeline`
+## Running a `ModularPipeline`

 The API to run the `ModularPipeline` is very similar to how you would run a regular `DiffusionPipeline`:

@@ -926,14 +953,14 @@ Under the hood, `ModularPipeline`'s `__call__` method is a wrapper around the pi

 You can inspect the docstring of a `ModularPipeline` to check what arguments the pipeline accepts and how to specify the `output` you want. It will list all available outputs (basically everything in the intermediate pipeline state) so you can choose from the list.

-**Important**: It is important to always check the docstring because arguments can be different from standard pipelines that you're familar with. For example, in Modular Diffusers we standardized controlnet image input as `control_image`, but regular pipelines have inconsistencies over the names, e.g. controlnet text-to-image uses `image` while SDXL controlnet img2img uses `control_image`.
-
-**Note**: The `output` list might be longer than you expected - it includes everything in the intermediate state that you can choose to return. Most of the time, you'll just want `output="images"` or `output="latents"`.
-
 ```py
 t2i_pipeline.doc
 ```

+**Important**: It is important to always check the docstring because arguments can be different from standard pipelines that you're familar with. For example, in Modular Diffusers we standardized controlnet image input as `control_image`, but regular pipelines have inconsistencies over the names, e.g. controlnet text-to-image uses `image` while SDXL controlnet img2img uses `control_image`.
+
+**Note**: The `output` list might be longer than you expected - it includes everything in the intermediate state that you can choose to return. Most of the time, you'll just want `output="images"` or `output="latents"`.
+
 </Tip>

 #### Text-to-Image, Image-to-Image, and Inpainting
@@ -1072,7 +1099,7 @@ StableDiffusionXLAutoControlnetStep(

 <Tip>

-💡 **Auto Blocks**: This is first time we meet a Auto Blocks! `AutoPipelineBlocks` automatically adapt to your inputs by combining multiple workflows with conditional logic. This is why one convenient block can work for all tasks and controlnet types. See the [Auto Blocks Guide](https://huggingface.co/docs/diffusers/modular_diffusers/write_own_pipeline_block#autopipelineblocks) for more details.
+💡 **Auto Blocks**: This is first time we meet a Auto Blocks! `AutoPipelineBlocks` automatically adapt to your inputs by combining multiple workflows with conditional logic. This is why one convenient block can work for all tasks and controlnet types. See the [Auto Blocks Guide](./auto_pipeline_blocks.md) for more details.

 </Tip>

@@ -0,0 +1,42 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Getting Started with Modular Diffusers
+
+<Tip warning={true}>
+
+🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
+
+</Tip>
+
+With Modular Diffusers, we introduce a unified pipeline system that simplifies how you work with diffusion models. Instead of creating separate pipelines for each task, Modular Diffusers lets you:
+
+**Write Only What's New**: You won't need to write an entire pipeline from scratch every time you have a new use case. You can create pipeline blocks just for your new workflow's unique aspects and reuse existing blocks for existing functionalities. 
+
+**Assemble Like LEGO®**: You can mix and match between blocks in flexible ways. This allows you to write dedicated blocks unique to specific workflows, and then assemble different blocks into a pipeline that can be used more conveniently for multiple workflows. 
+
+
+Here's how our guides are organized to help you navigate the Modular Diffusers documentation:
+
+### 🚀 Running Pipelines
+- **[Modular Pipeline Guide](./modular_pipeline.md)** - How to use predefined blocks to build a pipeline and run it
+- **[Components Manager Guide](./components_manager.md)** - How to manage and reuse components across multiple pipelines
+
+### 📚 Creating PipelineBlocks
+- **[Pipeline and Block States](./modular_diffusers_states.md)** - Understanding PipelineState and BlockState
+- **[Pipeline Block](./pipeline_block.md)** - How to write custom PipelineBlocks
+- **[SequentialPipelineBlocks](sequential_pipeline_blocks.md)** - Connecting blocks in sequence
+- **[LoopSequentialPipelineBlocks](./loop_sequential_pipeline_blocks.md)** - Creating iterative workflows
+- **[AutoPipelineBlocks](./auto_pipeline_blocks.md)** - Conditional block selection
+
+### 🎯 Practical Examples
+- **[End-to-End Example](./end_to_end_guide.md)** - Complete end-to-end examples including sharing your workflow in huggingface hub and deplying UI nodes
@@ -0,0 +1,292 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# PipelineBlock
+
+<Tip warning={true}>
+
+🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
+
+</Tip>
+
+In Modular Diffusers, you build your workflow using `ModularPipelineBlocks`. We support 4 different types of blocks: `PipelineBlock`, `SequentialPipelineBlocks`, `LoopSequentialPipelineBlocks`, and `AutoPipelineBlocks`. Among them, `PipelineBlock` is the most fundamental building block of the whole system - it's like a brick in a Lego system. These blocks are designed to easily connect with each other, allowing for modular construction of creative and potentially very complex workflows.
+
+<Tip>
+
+**Important**: `PipelineBlock`s are definitions/specifications, not runnable pipelines. They define what a block should do and what data it needs, but you need to convert them into a `ModularPipeline` to actually execute them. For information on creating and running pipelines, see the [Modular Pipeline guide](./modular_pipeline.md).
+
+</Tip>
+
+In this tutorial, we will focus on how to write a basic `PipelineBlock` and how it interacts with the pipeline state.
+
+## PipelineState
+
+Before we dive into creating `PipelineBlock`s, make sure you have a basic understanding of `PipelineState`. It acts as the global state container that all blocks operate on - each block gets a local view (`BlockState`) of the relevant variables it needs from `PipelineState`, performs its operations, and then updates `PipelineState` with any changes. See the [PipelineState and BlockState guide](./modular_diffusers_states.md) for more details.
+
+## Define a `PipelineBlock`
+
+To write a `PipelineBlock` class, you need to define a few properties that determine how your block interacts with the pipeline state. Understanding these properties is crucial - they define what data your block can access and what it can produce.
+
+The three main properties you need to define are:
+- `inputs`: Immutable values from the user that cannot be modified
+- `intermediate_inputs`: Mutable values from previous blocks that can be read and modified  
+- `intermediate_outputs`: New values your block creates for subsequent blocks and user access
+
+Let's explore each one and understand how they work with the pipeline state.
+
+**Inputs: Immutable User Values**
+
+Inputs are variables your block needs from the immutable pipeline state - these are user-provided values that cannot be modified by any block. You define them using `InputParam`:
+
+```py
+user_inputs = [
+    InputParam(name="image", type_hint="PIL.Image", description="raw input image to process")
+]
+```
+
+When you list something as an input, you're saying "I need this value directly from the end user, and I will talk to them directly, telling them what I need in the 'description' field. They will provide it and it will come to me unchanged."
+
+This is especially useful for raw values that serve as the "source of truth" in your workflow. For example, with a raw image, many workflows require preprocessing steps like resizing that a previous block might have performed. But in many cases, you also want the raw PIL image. In some inpainting workflows, you need the original image to overlay with the generated result for better control and consistency.
+
+**Intermediate Inputs: Mutable Values from Previous Blocks, or Users**
+
+Intermediate inputs are variables your block needs from the mutable pipeline state - these are values that can be read and modified. They're typically created by previous blocks, but could also be directly provided by the user if not the case:
+
+```py
+user_intermediate_inputs = [
+    InputParam(name="processed_image", type_hint="torch.Tensor", description="image that has been preprocessed and normalized"),
+]
+```
+
+When you list something as an intermediate input, you're saying "I need this value, but I want to work with a different block that has already created it. I already know for sure that I can get it from this other block, but it's okay if other developers want use something different."
+
+**Intermediate Outputs: New Values for Subsequent Blocks and User Access**
+
+Intermediate outputs are new variables your block creates and adds to the mutable pipeline state. They serve two purposes:
+
+1. **For subsequent blocks**: They can be used as intermediate inputs by other blocks in the pipeline
+2. **For users**: They become available as final outputs that users can access when running the pipeline
+
+```py
+user_intermediate_outputs = [
+    OutputParam(name="image_latents", description="latents representing the image")
+]
+```
+
+Intermediate inputs and intermediate outputs work together like Lego studs and anti-studs - they're the connection points that make blocks modular. When one block produces an intermediate output, it becomes available as an intermediate input for subsequent blocks. This is where the "modular" nature of the system really shines - blocks can be connected and reconnected in different ways as long as their inputs and outputs match.
+
+Additionally, all intermediate outputs are accessible to users when they run the pipeline, typically you would only need the final images, but they are also able to access intermediate results like latents, embeddings, or other processing steps.
+
+**The `__call__` Method Structure**
+
+Your `PipelineBlock`'s `__call__` method should follow this structure:
+
+```py
+def __call__(self, components, state):
+    # Get a local view of the state variables this block needs
+    block_state = self.get_block_state(state)
+    
+    # Your computation logic here
+    # block_state contains all your inputs and intermediate_inputs
+    # You can access them like: block_state.image, block_state.processed_image
+    
+    # Update the pipeline state with your updated block_states
+    self.set_block_state(state, block_state)
+    return components, state
+```
+
+The `block_state` object contains all the variables you defined in `inputs` and `intermediate_inputs`, making them easily accessible for your computation.
+
+**Components and Configs**
+
+You can define the components and pipeline-level configs your block needs using `ComponentSpec` and `ConfigSpec`:
+
+```py
+from diffusers import ComponentSpec, ConfigSpec
+
+# Define components your block needs
+expected_components = [
+    ComponentSpec(name="unet", type_hint=UNet2DConditionModel),
+    ComponentSpec(name="scheduler", type_hint=EulerDiscreteScheduler)
+]
+
+# Define pipeline-level configs
+expected_config = [
+    ConfigSpec("force_zeros_for_empty_prompt", True)
+]
+```
+
+**Components**: In the `ComponentSpec`, you must provide a `name` and ideally a `type_hint`. You can also specify a `default_creation_method` to indicate whether the component should be loaded from a pretrained model or created with default configurations. The actual loading details (`repo`, `subfolder`, `variant` and `revision` fields) are typically specified when creating the pipeline, as we covered in the [Modular Pipeline Guide](./modular_pipeline.md).
+
+**Configs**: Pipeline-level settings that control behavior across all blocks.
+
+When you convert your blocks into a pipeline using `blocks.init_pipeline()`, the pipeline collects all component requirements from the blocks and fetches the loading specs from the modular repository. The components are then made available to your block as the first argument of the `__call__` method. You can access any component you need using dot notation:
+
+```py
+def __call__(self, components, state):
+    # Access components using dot notation
+    unet = components.unet
+    vae = components.vae
+    scheduler = components.scheduler
+```
+
+That's all you need to define in order to create a `PipelineBlock`. There is no hidden complexity. In fact we are going to create a helper function that take exactly these variables as input and return a pipeline block. We will use this helper function through out the tutorial to create test blocks
+
+Note that for `__call__` method, the only part you should implement differently is the part between `self.get_block_state()` and `self.set_block_state()`, which can be abstracted into a simple function that takes `block_state` and returns the updated state. Our helper function accepts a `block_fn` that does exactly that.
+
+**Helper Function**
+
+```py
+from diffusers.modular_pipelines import PipelineBlock, InputParam, OutputParam
+import torch
+
+def make_block(inputs=[], intermediate_inputs=[], intermediate_outputs=[], block_fn=None, description=None):
+    class TestBlock(PipelineBlock):
+        model_name = "test"
+        
+        @property
+        def inputs(self):
+            return inputs
+            
+        @property
+        def intermediate_inputs(self):
+            return intermediate_inputs
+            
+        @property
+        def intermediate_outputs(self):
+            return intermediate_outputs
+            
+        @property
+        def description(self):
+            return description if description is not None else ""
+            
+        def __call__(self, components, state):
+            block_state = self.get_block_state(state)
+            if block_fn is not None:
+                block_state = block_fn(block_state, state)
+            self.set_block_state(state, block_state)
+            return components, state
+    
+    return TestBlock
+```
+
+## Example: Creating a Simple Pipeline Block
+
+Let's create a simple block to see how these definitions interact with the pipeline state. To better understand what's happening, we'll print out the states before and after updates to inspect them:
+
+```py
+inputs = [
+    InputParam(name="image", type_hint="PIL.Image", description="raw input image to process")
+]
+
+intermediate_inputs = [InputParam(name="batch_size", type_hint=int)]
+
+intermediate_outputs = [
+    OutputParam(name="image_latents", description="latents representing the image")
+]
+
+def image_encoder_block_fn(block_state, pipeline_state):
+    print(f"pipeline_state (before update): {pipeline_state}")
+    print(f"block_state (before update): {block_state}")
+    
+    # Simulate processing the image
+    block_state.image = torch.randn(1, 3, 512, 512)
+    block_state.batch_size = block_state.batch_size * 2
+    block_state.processed_image = [torch.randn(1, 3, 512, 512)] * block_state.batch_size
+    block_state.image_latents = torch.randn(1, 4, 64, 64)
+    
+    print(f"block_state (after update): {block_state}")
+    return block_state
+
+# Create a block with our definitions
+image_encoder_block_cls = make_block(
+    inputs=inputs, 
+    intermediate_inputs=intermediate_inputs,
+    intermediate_outputs=intermediate_outputs, 
+    block_fn=image_encoder_block_fn,
+    description="Encode raw image into its latent presentation"
+)
+image_encoder_block = image_encoder_block_cls()
+pipe = image_encoder_block.init_pipeline()
+```
+
+Let's check the pipeline's docstring to see what inputs it expects:
+```py
+>>> print(pipe.doc)
+class TestBlock
+
+  Encode raw image into its latent presentation
+
+  Inputs:
+
+      image (`PIL.Image`, *optional*):
+          raw input image to process
+
+      batch_size (`int`, *optional*):
+
+  Outputs:
+
+      image_latents (`None`):
+          latents representing the image
+```
+
+Notice that `batch_size` appears as an input even though we defined it as an intermediate input. This happens because no previous block provided it, so the pipeline makes it available as a user input. However, unlike regular inputs, this value goes directly into the mutable intermediate state.
+
+Now let's run the pipeline:
+
+```py
+from diffusers.utils import load_image
+
+image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/image_of_squirrel_painting.png")
+state = pipe(image=image, batch_size=2)
+print(f"pipeline_state (after update): {state}")
+```
+```out
+pipeline_state (before update): PipelineState(
+  inputs={
+    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494550>
+  },
+  intermediates={
+    batch_size: 2
+  },
+)
+block_state (before update): BlockState(
+    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494640>
+    batch_size: 2
+)
+
+block_state (after update): BlockState(
+    image: Tensor(dtype=torch.float32, shape=torch.Size([1, 3, 512, 512]))
+    batch_size: 4
+    processed_image: List[4] of Tensors with shapes [torch.Size([1, 3, 512, 512]), torch.Size([1, 3, 512, 512]), torch.Size([1, 3, 512, 512]), torch.Size([1, 3, 512, 512])]
+    image_latents: Tensor(dtype=torch.float32, shape=torch.Size([1, 4, 64, 64]))
+)
+pipeline_state (after update): PipelineState(
+  inputs={
+    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494550>
+  },
+  intermediates={
+    batch_size: 4
+    image_latents: Tensor(dtype=torch.float32, shape=torch.Size([1, 4, 64, 64]))
+  },
+)
+```
+
+**Key Observations:**
+
+1. **Before the update**: `image` (the input) goes to the immutable inputs dict, while `batch_size` (the intermediate_input) goes to the mutable intermediates dict, and both are available in `block_state`.
+
+2. **After the update**:
+   - **`image` (inputs)** changed in `block_state` but not in `pipeline_state` - this change is local to the block only. 
+   - **`batch_size (intermediate_inputs)`** was updated in both `block_state` and `pipeline_state` - this change affects subsequent blocks (we didn't need to declare it as an intermediate output since it was already in the intermediates dict)
+   - **`image_latents (intermediate_outputs)`** was added to `pipeline_state` because it was declared as an intermediate output
+   - **`processed_image`** was not added to `pipeline_state` because it wasn't declared as an intermediate output
@@ -0,0 +1,189 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# SequentialPipelineBlocks
+
+<Tip warning={true}>
+
+🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
+
+</Tip>
+
+`SequentialPipelineBlocks` is a subclass of `ModularPipelineBlocks`. Unlike `PipelineBlock`, it is a multi-block that composes other blocks together in sequence, creating modular workflows where data flows from one block to the next. It's one of the most common ways to build complex pipelines by combining simpler building blocks.
+
+<Tip>
+
+Other types of multi-blocks include [AutoPipelineBlocks](auto_pipeline_blocks.md) (for conditional block selection) and [LoopSequentialPipelineBlocks](loop_sequential_pipeline_blocks.md) (for iterative workflows). For information on creating individual blocks, see the [PipelineBlock guide](pipeline_block.md).
+
+Additionally, like all `ModularPipelineBlocks`, `SequentialPipelineBlocks` are definitions/specifications, not runnable pipelines. You need to convert them into a `ModularPipeline` to actually execute them. For information on creating and running pipelines, see the [Modular Pipeline guide](modular_pipeline.md).
+
+</Tip>
+
+In this tutorial, we will focus on how to create `SequentialPipelineBlocks` and how blocks connect and work together.
+
+The key insight is that blocks connect through their intermediate inputs and outputs - the "studs and anti-studs" we discussed in the [PipelineBlock guide](pipeline_block.md). When one block produces an intermediate output, it becomes available as an intermediate input for subsequent blocks.
+
+Let's explore this through an example. We will use the same helper function from the PipelineBlock guide to create blocks.
+
+```py
+from diffusers.modular_pipelines import PipelineBlock, InputParam, OutputParam
+import torch
+
+def make_block(inputs=[], intermediate_inputs=[], intermediate_outputs=[], block_fn=None, description=None):
+    class TestBlock(PipelineBlock):
+        model_name = "test"
+        
+        @property
+        def inputs(self):
+            return inputs
+            
+        @property
+        def intermediate_inputs(self):
+            return intermediate_inputs
+            
+        @property
+        def intermediate_outputs(self):
+            return intermediate_outputs
+            
+        @property
+        def description(self):
+            return description if description is not None else ""
+            
+        def __call__(self, components, state):
+            block_state = self.get_block_state(state)
+            if block_fn is not None:
+                block_state = block_fn(block_state, state)
+            self.set_block_state(state, block_state)
+            return components, state
+    
+    return TestBlock
+```
+
+Let's create a block that produces `batch_size`, which we'll call "input_block":
+
+```py
+def input_block_fn(block_state, pipeline_state):
+    
+    batch_size = len(block_state.prompt)
+    block_state.batch_size = batch_size * block_state.num_images_per_prompt
+    
+    return block_state
+
+input_block_cls = make_block(
+    inputs=[
+        InputParam(name="prompt", type_hint=list, description="list of text prompts"),
+        InputParam(name="num_images_per_prompt", type_hint=int, description="number of images per prompt")
+    ],
+    intermediate_outputs=[
+        OutputParam(name="batch_size", description="calculated batch size")
+    ],
+    block_fn=input_block_fn,
+    description="A block that determines batch_size based on the number of prompts and num_images_per_prompt argument."
+)
+input_block = input_block_cls()
+```
+
+Now let's create a second block that uses the `batch_size` from the first block:
+
+```py
+def image_encoder_block_fn(block_state, pipeline_state):
+    # Simulate processing the image
+    block_state.image = torch.randn(1, 3, 512, 512)
+    block_state.batch_size = block_state.batch_size * 2
+    block_state.image_latents = torch.randn(1, 4, 64, 64)
+    return block_state
+
+image_encoder_block_cls = make_block(
+    inputs=[
+        InputParam(name="image", type_hint="PIL.Image", description="raw input image to process")
+    ],
+    intermediate_inputs=[
+        InputParam(name="batch_size", type_hint=int)
+    ],
+    intermediate_outputs=[
+        OutputParam(name="image_latents", description="latents representing the image")
+    ],
+    block_fn=image_encoder_block_fn,
+    description="Encode raw image into its latent presentation"
+)
+image_encoder_block = image_encoder_block_cls()
+```
+
+Now let's connect these blocks to create a `SequentialPipelineBlocks`:
+
+```py
+from diffusers.modular_pipelines import SequentialPipelineBlocks, InsertableDict
+
+# Define a dict mapping block names to block instances
+blocks_dict = InsertableDict()
+blocks_dict["input"] = input_block
+blocks_dict["image_encoder"] = image_encoder_block
+
+# Create the SequentialPipelineBlocks
+blocks = SequentialPipelineBlocks.from_blocks_dict(blocks_dict)
+```
+
+Now you have a `SequentialPipelineBlocks` with 2 blocks:
+
+```py
+>>> blocks
+SequentialPipelineBlocks(
+  Class: ModularPipelineBlocks
+
+  Description: 
+
+
+  Sub-Blocks:
+    [0] input (TestBlock)
+       Description: A block that determines batch_size based on the number of prompts and num_images_per_prompt argument.
+
+    [1] image_encoder (TestBlock)
+       Description: Encode raw image into its latent presentation
+
+)
+```
+
+When you inspect `blocks.doc`, you can see that `batch_size` is not listed as an input. The pipeline automatically detects that the `input_block` can produce `batch_size` for the `image_encoder_block`, so it doesn't ask the user to provide it.
+
+```py
+>>> print(blocks.doc)
+class SequentialPipelineBlocks
+
+  Inputs:
+
+      prompt (`None`, *optional*):
+
+      num_images_per_prompt (`None`, *optional*):
+
+      image (`PIL.Image`, *optional*):
+          raw input image to process
+
+  Outputs:
+
+      batch_size (`None`):
+
+      image_latents (`None`):
+          latents representing the image
+```
+
+At runtime, you have data flow like this:
+
+![Data Flow Diagram](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/modular_quicktour/Editor%20_%20Mermaid%20Chart-2025-06-30-092631.png)
+
+**How SequentialPipelineBlocks Works:**
+
+1. Blocks are executed in the order they're registered in the `blocks_dict`
+2. Outputs from one block become available as intermediate inputs to all subsequent blocks
+3. The pipeline automatically figures out which values need to be provided by the user and which will be generated by previous blocks
+4. Each block maintains its own behavior and operates through its defined interface, while collectively these interfaces determine what the entire pipeline accepts and produces
+
+What happens within each block follows the same pattern we described earlier: each block gets its own `block_state` with the relevant inputs and intermediate inputs, performs its computation, and updates the pipeline state with its intermediate outputs.
@@ -1,817 +0,0 @@
-<!--Copyright 2025 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-->
-
-# Writing Your Own Pipeline Blocks
-
-<Tip warning={true}>
-
-🧪 **Experimental Feature**: Modular Diffusers is an experimental feature we are actively developing. The API may be subject to breaking changes.
-
-</Tip>
-
-In Modular Diffusers, you build your workflow using `ModularPipelineBlocks`. We support 4 different types of blocks: `PipelineBlock`, `SequentialPipelineBlocks`, `LoopSequentialPipelineBlocks`, and `AutoPipelineBlocks`. Among them, `PipelineBlock` is the most fundamental building block of the whole system - it's like a brick in a Lego system. These blocks are designed to easily connect with each other, allowing for modular construction of creative and potentially very complex workflows.
-
-In this tutorial, we will focus on how to write a basic `PipelineBlock` and how it interacts with other components in the system. We will also cover how to connect them together using the multi-blocks: `SequentialPipelineBlocks`, `LoopSequentialPipelineBlocks`, and `AutoPipelineBlocks`.
-
-
-## Understanding the Foundation: `PipelineState`
-
-Before we dive into creating `PipelineBlock`s, we need to have a basic understanding of `PipelineState` - the core data structure that all blocks operate on. This concept is fundamental to understanding how blocks interact with each other and the pipeline system.
-
-In the modular diffusers system, `PipelineState` acts as the global state container that `PipelineBlock`s operate on - each block gets a local view (`BlockState`) of the relevant variables it needs from `PipelineState`, performs its operations, and then updates `PipelineState` with any changes.
-
-While `PipelineState` maintains the complete runtime state of the pipeline, `PipelineBlock`s define what parts of that state they can read from and write to through their `input`s, `intermediates_inputs`, and `intermediates_outputs` properties.
-
-A `PipelineState` consists of two distinct states:
- The **immutable state** (i.e. the `inputs` dict) contains a copy of values provided by users. Once a value is added to the immutable state, it cannot be changed. Blocks can read from the immutable state but cannot write to it.
- The **mutable state** (i.e. the `intermediates` dict) contains variables that are passed between blocks and can be modified by them.
-
-Here's an example of what a `PipelineState` looks like:
-
-```
-PipelineState(
-  inputs={
-    prompt: 'a cat'
-    guidance_scale: 7.0
-    num_inference_steps: 25
-  },
-  intermediates={
-    prompt_embeds: Tensor(dtype=torch.float32, shape=torch.Size([1, 1, 1, 1]))
-    negative_prompt_embeds: None
-  },
-```
-
-## Creating a `PipelineBlock`
-
-To write a `PipelineBlock` class, you need to define a few properties that determine how your block interacts with the pipeline state. Understanding these properties is crucial - they define what data your block can access and what it can produce.
-
-The three main properties you need to define are:
- `inputs`: Immutable values from the user that cannot be modified
- `intermediate_inputs`: Mutable values from previous blocks that can be read and modified  
- `intermediate_outputs`: New values your block creates for subsequent blocks
-
-Let's explore each one and understand how they work with the pipeline state.
-
-**Inputs: Immutable User Values**
-
-Inputs are variables your block needs from the immutable pipeline state - these are user-provided values that cannot be modified by any block. You define them using `InputParam`:
-
-```py
-user_inputs = [
-    InputParam(name="image", type_hint="PIL.Image", description="raw input image to process")
-]
-```
-
-When you list something as an input, you're saying "I need this value directly from the end user, and I will talk to them directly, telling them what I need in the 'description' field. They will provide it and it will come to me unchanged."
-
-This is especially useful for raw values that serve as the "source of truth" in your workflow. For example, with a raw image, many workflows require preprocessing steps like resizing that a previous block might have performed. But in many cases, you also want the raw PIL image. In some inpainting workflows, you need the original image to overlay with the generated result for better control and consistency.
-
-**Intermediate Inputs: Mutable Values from Previous Blocks**
-
-Intermediate inputs are variables your block needs from the mutable pipeline state - these are values that can be read and modified. They're typically created by previous blocks, but could also be directly provided by the user if not the case:
-
-```py
-user_intermediate_inputs = [
-    InputParam(name="processed_image", type_hint="torch.Tensor", description="image that has been preprocessed and normalized"),
-]
-```
-
-When you list something as an intermediate input, you're saying "I need this value, but I want to work with a different block that has already created it. I already know for sure that I can get it from this other block, but it's okay if other developers want use something different."
-
-**Intermediate Outputs: New Values for Subsequent Blocks**
-
-Intermediate outputs are new variables your block creates and adds to the mutable pipeline state so they can be used by subsequent blocks:
-
-```py
-user_intermediate_outputs = [
-    OutputParam(name="image_latents", description="latents representing the image")
-]
-```
-
-Intermediate inputs and intermediate outputs work together like Lego studs and anti-studs - they're the connection points that make blocks modular. When one block produces an intermediate output, it becomes available as an intermediate input for subsequent blocks. This is where the "modular" nature of the system really shines - blocks can be connected and reconnected in different ways as long as their inputs and outputs match. We will see more how they connect when we talk about multi-blocks.
-
-**The `__call__` Method Structure**
-
-Your `PipelineBlock`'s `__call__` method should follow this structure:
-
-```py
-def __call__(self, components, state):
-    # Get a local view of the state variables this block needs
-    block_state = self.get_block_state(state)
-    
-    # Your computation logic here
-    # block_state contains all your inputs and intermediate_inputs
-    # You can access them like: block_state.image, block_state.processed_image
-    
-    # Update the pipeline state with your updated block_states
-    self.set_block_state(state, block_state)
-    return components, state
-```
-
-The `block_state` object contains all the variables you defined in `inputs` and `intermediate_inputs`, making them easily accessible for your computation.
-
-**Components and Configs**
-
-You can define the components and pipeline-level configs your block needs using `ComponentSpec` and `ConfigSpec`:
-
-```py
-from diffusers import ComponentSpec, ConfigSpec
-
-# Define components your block needs
-expected_components = [
-    ComponentSpec(name="unet", type_hint=UNet2DConditionModel),
-    ComponentSpec(name="scheduler", type_hint=EulerDiscreteScheduler)
-]
-
-# Define pipeline-level configs
-expected_config = [
-    ConfigSpec("force_zeros_for_empty_prompt", True)
-]
-```
-
-**Components**: In the `ComponentSpec`, You must provide a `name` and ideally a `type_hint`. The actual loading details (`repo`, `subfolder`, `variant` and `revision` fields) are typically specified when creating the pipeline, as we covered in the [Getting Started Guide](https://huggingface.co/docs/diffusers/en/modular_diffusers/getting_started#loading-components-into-a-modularpipeline).
-
-**Configs**: Simple pipeline-level settings that control behavior across all blocks.
-
-When you convert your blocks into a pipeline using `blocks.init_pipeline()`, the pipeline collects all component requirements from the blocks and fetches the loading specs from the modular repository. The components are then made available to your block in the `components` argument of the `__call__` method.
-
-That's all you need to define in order to create a `PipelineBlock`. There is no hidden complexity. In fact we are going to create a helper function that take exactly these variables as input and return a pipeline block. We will use this helper function through out the tutorial to create test blocks
-
-Note that for `__call__` method, the only part you should implement differently is the part between `self.get_block_state()` and `self.set_block_state()`, which can be abstracted into a simple function that takes `block_state` and returns the updated state. Our helper function accepts a `block_fn` that does exactly that.
-
-**Helper Function**
-
-```py
-from diffusers.modular_pipelines import PipelineBlock, InputParam, OutputParam
-import torch
-
-def make_block(inputs=[], intermediate_inputs=[], intermediate_outputs=[], block_fn=None, description=None):
-    class TestBlock(PipelineBlock):
-        model_name = "test"
-        
-        @property
-        def inputs(self):
-            return inputs
-            
-        @property
-        def intermediate_inputs(self):
-            return intermediate_inputs
-            
-        @property
-        def intermediate_outputs(self):
-            return intermediate_outputs
-            
-        @property
-        def description(self):
-            return description if description is not None else ""
-            
-        def __call__(self, components, state):
-            block_state = self.get_block_state(state)
-            if block_fn is not None:
-                block_state = block_fn(block_state, state)
-            self.set_block_state(state, block_state)
-            return components, state
-    
-    return TestBlock
-```
-
-
-Let's create a simple block to see how these definitions interact with the pipeline state. To better understand what's happening, we'll print out the states before and after updates to inspect them:
-
-```py
-inputs = [
-    InputParam(name="image", type_hint="PIL.Image", description="raw input image to process")
-]
-
-intermediate_inputs = [InputParam(name="batch_size", type_hint=int)]
-
-intermediate_outputs = [
-    OutputParam(name="image_latents", description="latents representing the image")
-]
-
-def image_encoder_block_fn(block_state, pipeline_state):
-    print(f"pipeline_state (before update): {pipeline_state}")
-    print(f"block_state (before update): {block_state}")
-    
-    # Simulate processing the image
-    block_state.image = torch.randn(1, 3, 512, 512)
-    block_state.batch_size = block_state.batch_size * 2
-    block_state.processed_image = [torch.randn(1, 3, 512, 512)] * block_state.batch_size
-    block_state.image_latents = torch.randn(1, 4, 64, 64)
-    
-    print(f"block_state (after update): {block_state}")
-    return block_state
-
-# Create a block with our definitions
-image_encoder_block_cls = make_block(
-    inputs=inputs, 
-    intermediate_inputs=intermediate_inputs,
-    intermediate_outputs=intermediate_outputs, 
-    block_fn=image_encoder_block_fn,
-    description=" Encode raw image into its latent presentation"
-)
-image_encoder_block = image_encoder_block_cls()
-pipe = image_encoder_block.init_pipeline()
-```
-
-Let's check the pipeline's docstring to see what inputs it expects:
-```py
->>> print(pipe.doc)
-class TestBlock
-
-  Encode raw image into its latent presentation
-
-  Inputs:
-
-      image (`PIL.Image`, *optional*):
-          raw input image to process
-
-      batch_size (`int`, *optional*):
-
-  Outputs:
-
-      image_latents (`None`):
-          latents representing the image
-```
-
-Notice that `batch_size` appears as an input even though we defined it as an intermediate input. This happens because no previous block provided it, so the pipeline makes it available as a user input. However, unlike regular inputs, this value goes directly into the mutable intermediate state.
-
-Now let's run the pipeline:
-
-```py
-from diffusers.utils import load_image
-
-image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/image_of_squirrel_painting.png")
-state = pipe(image=image, batch_size=2)
-print(f"pipeline_state (after update): {state}")
-```
-```out
-pipeline_state (before update): PipelineState(
-  inputs={
-    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494550>
-  },
-  intermediates={
-    batch_size: 2
-  },
-)
-block_state (before update): BlockState(
-    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494640>
-    batch_size: 2
-)
-
-block_state (after update): BlockState(
-    image: Tensor(dtype=torch.float32, shape=torch.Size([1, 3, 512, 512]))
-    batch_size: 4
-    processed_image: List[4] of Tensors with shapes [torch.Size([1, 3, 512, 512]), torch.Size([1, 3, 512, 512]), torch.Size([1, 3, 512, 512]), torch.Size([1, 3, 512, 512])]
-    image_latents: Tensor(dtype=torch.float32, shape=torch.Size([1, 4, 64, 64]))
-)
-pipeline_state (after update): PipelineState(
-  inputs={
-    image: <PIL.Image.Image image mode=RGB size=512x512 at 0x7F3ECC494550>
-  },
-  intermediates={
-    batch_size: 4
-    image_latents: Tensor(dtype=torch.float32, shape=torch.Size([1, 4, 64, 64]))
-  },
-)
-```
-**Key Observations:**
-
-1. **Before the update**: `image` (the input) goes to the immutable inputs dict, while `batch_size` (the intermediate_input) goes to the mutable intermediates dict, and both are available in `block_state`.
-
-2. **After the update**:
-   - **`image` (inputs)** changed in `block_state` but not in `pipeline_state` - this change is local to the block only. 
-   - **`batch_size (intermediate_inputs)`** was updated in both `block_state` and `pipeline_state` - this change affects subsequent blocks (we didn't need to declare it as an intermediate output since it was already in the intermediates dict)
-   - **`image_latents (intermediate_outputs)`** was added to `pipeline_state` because it was declared as an intermediate output
-   - **`processed_image`** was not added to `pipeline_state` because it wasn't declared as an intermediate output
-
-I hope by now you have a basic idea about how `PipelineBlock` manages state through inputs, intermediate inputs, and intermediate outputs. The real power comes when we connect multiple blocks together - their intermediate outputs become intermediate inputs for subsequent blocks, creating modular workflows. Let's explore how to build these connections using multi-blocks like `SequentialPipelineBlocks`.
-
-## Create a `SequentialPipelineBlocks`
-
-I assume that you're already familiar with `SequentialPipelineBlocks` and how to create them with the `from_blocks_dict` API. It's one of the most common ways to use Modular Diffusers, and we've covered it pretty well in the [Getting Started Guide](https://huggingface.co/docs/diffusers/pr_9672/en/modular_diffusers/getting_started#modularpipelineblocks).
-
-But how do blocks actually connect and work together? Understanding this is crucial for building effective modular workflows. Let's explore this through an example.
-
-**How Blocks Connect in SequentialPipelineBlocks:**
-
-The key insight is that blocks connect through their intermediate inputs and outputs - the "studs and anti-studs" we discussed earlier. Let's expand on our example to create a new block that produces `batch_size`, which we'll call "input_block":
-
-```py
-def input_block_fn(block_state, pipeline_state):
-    
-    batch_size = len(block_state.prompt)
-    block_state.batch_size = batch_size * block_state.num_images_per_prompt
-    
-    return block_state
-
-input_block_cls = make_block(
-    inputs=[
-        InputParam(name="prompt", type_hint=list, description="list of text prompts"),
-        InputParam(name="num_images_per_prompt", type_hint=int, description="number of images per prompt")
-    ],
-    intermediate_outputs=[
-        OutputParam(name="batch_size", description="calculated batch size")
-    ],
-    block_fn=input_block_fn,
-    description="A block that determines batch_size based on the number of prompts and num_images_per_prompt argument."
-)
-input_block = input_block_cls()
-```
-
-Now let's connect these blocks to create a pipeline:
-
-```py
-from diffusers.modular_pipelines import SequentialPipelineBlocks, InsertableDict
-# define a dict map block names to block class
-blocks_dict = InsertableDict()
-blocks_dict["input"] = input_block
-blocks_dict["image_encoder"] = image_encoder_block
-# create the multi-block
-blocks = SequentialPipelineBlocks.from_blocks_dict(blocks_dict)
-# convert it to a runnable pipeline
-pipeline = blocks.init_pipeline()
-```
-
-Now you have a pipeline with 2 blocks. 
-
-```py
->>> pipeline.blocks
-SequentialPipelineBlocks(
-  Class: ModularPipelineBlocks
-
-  Description: 
-
-
-  Sub-Blocks:
-    [0] input (TestBlock)
-       Description: A block that determines batch_size based on the number of prompts and num_images_per_prompt argument.
-
-    [1] image_encoder (TestBlock)
-       Description:  Encode raw image into its latent presentation
-
-)
-```
-
-When you inspect `pipeline.doc`, you can see that `batch_size` is not listed as an input. The pipeline automatically detects that the `input_block` can produce `batch_size` for the `image_encoder_block`, so it doesn't ask the user to provide it.
-
-```py
->>> print(pipeline.doc)
-class SequentialPipelineBlocks
-
-  Inputs:
-
-      prompt (`None`, *optional*):
-
-      num_images_per_prompt (`None`, *optional*):
-
-      image (`PIL.Image`, *optional*):
-          raw input image to process
-
-  Outputs:
-
-      batch_size (`None`):
-
-      image_latents (`None`):
-          latents representing the image
-```
-
-At runtime, you have data flow like this:
-
-![Data Flow Diagram](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/modular_quicktour/Editor%20_%20Mermaid%20Chart-2025-06-30-092631.png)
-
-**How SequentialPipelineBlocks Works:**
-
-1. Blocks are executed in the order they're registered in the `blocks_dict`
-2. Outputs from one block become available as intermediate inputs to all subsequent blocks
-3. The pipeline automatically figures out which values need to be provided by the user and which will be generated by previous blocks
-4. Each block maintains its own behavior and operates through its defined interface, while collectively these interfaces determine what the entire pipeline accepts and produces
-
-What happens within each block follows the same pattern we described earlier: each block gets its own `block_state` with the relevant inputs and intermediate inputs, performs its computation, and updates the pipeline state with its intermediate outputs.
-
-## `LoopSequentialPipelineBlocks`
-
-To create a loop in Modular Diffusers, you could use a single `PipelineBlock` like this:
-
-```python
-class DenoiseLoop(PipelineBlock):
-    def __call__(self, components, state):
-        block_state = self.get_block_state(state)
-        for t in range(block_state.num_inference_steps):
-            # ... loop logic here
-            pass
-        self.set_block_state(state, block_state)
-        return components, state
-```
-
-Or you could create a `LoopSequentialPipelineBlocks`. The key difference is that with `LoopSequentialPipelineBlocks`, the loop itself is modular: you can add or remove blocks within the loop or reuse the same loop structure with different block combinations.
-
-It involves two parts: a **loop wrapper** and **loop blocks**
-
-* The **loop wrapper** (`LoopSequentialPipelineBlocks`) defines the loop structure, e.g. it defines the iteration variables, and loop configurations such as progress bar.
-
-* The **loop blocks** are basically standard pipeline blocks you add to the loop wrapper.
-  - they run sequentially for each iteration of the loop
-  - they receive the current iteration index as an additional parameter
-  - they share the same block_state throughout the entire loop
-
-Unlike regular `SequentialPipelineBlocks` where each block gets its own state, loop blocks share a single state that persists and evolves across iterations.
-
-We will build a simple loop block to demonstrate these concepts. Creating a loop block involves three steps:
-1. defining the loop wrapper class
-2. creating the loop blocks
-3. adding the loop blocks to the loop wrapper class to create the loop wrapper instance
-
-**Step 1: Define the Loop Wrapper**
-
-To create a `LoopSequentialPipelineBlocks` class, you need to define:
-
-* `loop_inputs`: User input variables (equivalent to `PipelineBlock.inputs`)
-* `loop_intermediate_inputs`: Intermediate variables needed from the mutable pipeline state (equivalent to `PipelineBlock.intermediates_inputs`)
-* `loop_intermediate_outputs`: New intermediate variables this block will add to the mutable pipeline state (equivalent to `PipelineBlock.intermediates_outputs`)
-* `__call__` method: Defines the loop structure and iteration logic
-
-Here is an example of a loop wrapper:
-
-```py
-import torch
-from diffusers.modular_pipelines import LoopSequentialPipelineBlocks, PipelineBlock, InputParam, OutputParam
-
-class LoopWrapper(LoopSequentialPipelineBlocks):
-    model_name = "test"
-    @property
-    def description(self):
-        return "I'm a loop!!"
-    @property
-    def loop_inputs(self):
-        return [InputParam(name="num_steps")]
-    @torch.no_grad()
-    def __call__(self, components, state):
-        block_state = self.get_block_state(state)
-        # Loop structure - can be customized to your needs
-        for i in range(block_state.num_steps):
-            # loop_step executes all registered blocks in sequence
-            components, block_state = self.loop_step(components, block_state, i=i)
-        self.set_block_state(state, block_state)
-        return components, state
-```
-
-**Step 2: Create Loop Blocks**
-
-Loop blocks are standard `PipelineBlock`s, but their `__call__` method works differently:
-* It receives the iteration variable (e.g., `i`) passed by the loop wrapper
-* It works directly with `block_state` instead of pipeline state
-* No need to call `self.get_block_state()` or `self.set_block_state()`
-
-```py
-class LoopBlock(PipelineBlock):
-    # this is used to identify the model family, we won't worry about it in this example
-    model_name = "test"
-    @property
-    def inputs(self):
-        return [InputParam(name="x")]
-    @property
-    def intermediate_outputs(self):
-        # outputs produced by this block
-        return [OutputParam(name="x")]
-    @property
-    def description(self):
-        return "I'm a block used inside the `LoopWrapper` class"
-    def __call__(self, components, block_state, i: int):
-        block_state.x += 1
-        return components, block_state
-```
-
-**Step 3: Combine Everything**
-
-Finally, assemble your loop by adding the block(s) to the wrapper:
-
-```py
-loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock})
-```
-
-Now you've created a loop with one step:
-
-```py
->>> loop
-LoopWrapper(
-  Class: LoopSequentialPipelineBlocks
-
-  Description: I'm a loop!!
-
-  Sub-Blocks:
-    [0] block1 (LoopBlock)
-       Description: I'm a block used inside the `LoopWrapper` class
-
-)
-```
-
-It has two inputs: `x` (used at each step within the loop) and `num_steps` used to define the loop.
-
-```py
->>> print(loop.doc)
-class LoopWrapper
-
-  I'm a loop!!
-
-  Inputs:
-
-      x (`None`, *optional*):
-
-      num_steps (`None`, *optional*):
-
-  Outputs:
-
-      x (`None`):
-```
-
-**Running the Loop:**
-
-```py
-# run the loop
-loop_pipeline = loop.init_pipeline()
-x = loop_pipeline(num_steps=10, x=0, output="x")
-assert x == 10
-```
-
-**Adding Multiple Blocks:**
-
-We can add multiple blocks to run within each iteration. Let's run the loop block twice within each iteration:
-
-```py
-loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock(), "block2": LoopBlock})
-loop_pipeline = loop.init_pipeline()
-x = loop_pipeline(num_steps=10, x=0, output="x")
-assert x == 20  # Each iteration runs 2 blocks, so 10 iterations * 2 = 20
-```
-
-**Key Differences from SequentialPipelineBlocks:**
-
-The main difference is that loop blocks share the same `block_state` across all iterations, allowing values to accumulate and evolve throughout the loop. Loop blocks could receive additional arguments (like the current iteration index) depending on the loop wrapper's implementation, since the wrapper defines how loop blocks are called. You can easily add, remove, or reorder blocks within the loop without changing the loop logic itself.
-
-The officially supported denoising loops in Modular Diffusers are implemented using `LoopSequentialPipelineBlocks`. You can explore the actual implementation to see how these concepts work in practice:
-
-```py
-from diffusers.modular_pipelines.stable_diffusion_xl.denoise import StableDiffusionXLDenoiseStep
-StableDiffusionXLDenoiseStep()
-```
-
-## `AutoPipelineBlocks`
-
-`AutoPipelineBlocks` allows you to pack different pipelines into one and automatically select which one to run at runtime based on the inputs. The main purpose is convenience and portability - for developers, you can package everything into one workflow, making it easier to share and use.
-
-For example, you might want to support text-to-image and image-to-image tasks. Instead of creating two separate pipelines, you can create an `AutoPipelineBlocks` that automatically chooses the workflow based on whether an `image` input is provided.
-
-Let's see an example. Here we'll create a dummy `AutoPipelineBlocks` that includes dummy text-to-image, image-to-image, and inpaint pipelines.
-
-
-```py
-from diffusers.modular_pipelines import AutoPipelineBlocks 
-
-# These are dummy blocks and we only focus on "inputs" for our purpose
-inputs = [InputParam(name="prompt")]
-# block_fn prints out which workflow is running so we can see the execution order at runtime
-block_fn = lambda x, y: print("running the text-to-image workflow")
-block_t2i_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a text-to-image workflow!")
-
-inputs = [InputParam(name="prompt"), InputParam(name="image")]
-block_fn = lambda x, y: print("running the image-to-image workflow")
-block_i2i_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a image-to-image workflow!")
-
-inputs = [InputParam(name="prompt"), InputParam(name="image"), InputParam(name="mask")]
-block_fn = lambda x, y: print("running the inpaint workflow")
-block_inpaint_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a inpaint workflow!")
-
-class AutoImageBlocks(AutoPipelineBlocks):
-    # List of sub-block classes to choose from
-    block_classes = [block_inpaint_cls, block_i2i_cls, block_t2i_cls]
-    # Names for each block in the same order
-    block_names = ["inpaint", "img2img", "text2img"]
-    # Trigger inputs that determine which block to run
-    # - "mask" triggers inpaint workflow
-    # - "image" triggers img2img workflow (but only if mask is not provided) 
-    # - if none of above, runs the text2img workflow (default)
-    block_trigger_inputs = ["mask", "image", None]
-    # Description is extremely important for AutoPipelineBlocks
-    @property
-    def description(self):
-        return (
-            "Pipeline generates images given different types of conditions!\n"
-            + "This is an auto pipeline block that works for text2img, img2img and inpainting tasks.\n"
-            + " - inpaint workflow is run when `mask` is provided.\n"
-            + " - img2img workflow is run when `image` is provided (but only when `mask` is not provided).\n"
-            + " - text2img workflow is run when neither `image` nor `mask` is provided.\n"
-        )
-
-# Create the blocks
-auto_blocks = AutoImageBlocks()
-# convert to pipeline
-auto_pipeline = auto_blocks.init_pipeline()
-```
-
-Now we have created an `AutoPipelineBlocks` that contains 3 sub-blocks. Notice the warning message at the top - this automatically appears in every `ModularPipelineBlocks` that contains `AutoPipelineBlocks` to remind end users that dynamic block selection happens at runtime. 
-
-```py
-AutoImageBlocks(
-  Class: AutoPipelineBlocks
-
-  ====================================================================================================
-  This pipeline contains blocks that are selected at runtime based on inputs.
-  Trigger Inputs: ['mask', 'image']
-  ====================================================================================================
-
-
-  Description: Pipeline generates images given different types of conditions!
-      This is an auto pipeline block that works for text2img, img2img and inpainting tasks.
-       - inpaint workflow is run when `mask` is provided.
-       - img2img workflow is run when `image` is provided (but only when `mask` is not provided).
-       - text2img workflow is run when neither `image` nor `mask` is provided.
-      
-
-
-  Sub-Blocks:
-    • inpaint [trigger: mask] (TestBlock)
-       Description: I'm a inpaint workflow!
-
-    • img2img [trigger: image] (TestBlock)
-       Description: I'm a image-to-image workflow!
-
-    • text2img [default] (TestBlock)
-       Description: I'm a text-to-image workflow!
-
-)
-```
-
-Check out the documentation with `print(auto_pipeline.doc)`:
-
-```py
->>> print(auto_pipeline.doc)
-class AutoImageBlocks
-
-  Pipeline generates images given different types of conditions!
-  This is an auto pipeline block that works for text2img, img2img and inpainting tasks.
-   - inpaint workflow is run when `mask` is provided.
-   - img2img workflow is run when `image` is provided (but only when `mask` is not provided).
-   - text2img workflow is run when neither `image` nor `mask` is provided.
-
-  Inputs:
-
-      prompt (`None`, *optional*):
-
-      image (`None`, *optional*):
-
-      mask (`None`, *optional*):
-```
-
-There is a fundamental trade-off of AutoPipelineBlocks: it trades clarity for convenience. While it is really easy for packaging multiple workflows, it can become confusing without proper documentation. e.g. if we just throw a pipeline at you and tell you that it contains 3 sub-blocks and takes 3 inputs `prompt`, `image` and `mask`, and ask you to run an image-to-image workflow: if you don't have any prior knowledge on how these pipelines work, you would be pretty clueless, right?
-
-This pipeline we just made though, has a docstring that shows all available inputs and workflows and explains how to use each with different inputs. So it's really helpful for users. For example, it's clear that you need to pass `image` to run img2img. This is why the description field is absolutely critical for AutoPipelineBlocks. We highly recommend you to explain the conditional logic very well for each `AutoPipelineBlocks` you would make. We also recommend to always test individual pipelines first before packaging them into AutoPipelineBlocks. 
-
-Let's run this auto pipeline with different inputs to see if the conditional logic works as described. Remember that we have added `print` in each `PipelineBlock`'s `__call__` method to print out its workflow name, so it should be easy to tell which one is running:
-
-```py
->>> _ = auto_pipeline(image="image", mask="mask")
-running the inpaint workflow
->>> _ = auto_pipeline(image="image")
-running the image-to-image workflow
->>> _ = auto_pipeline(prompt="prompt")
-running the text-to-image workflow
->>> _ = auto_pipeline(image="prompt", mask="mask")
-running the inpaint workflow
-```
-
-However, even with documentation, it can become very confusing when AutoPipelineBlocks are combined with other blocks. The complexity grows quickly when you have nested AutoPipelineBlocks or use them as sub-blocks in larger pipelines.
-
-Let's make another `AutoPipelineBlocks` - this one only contains one block, and it does not include `None` in its `block_trigger_inputs` (which corresponds to the default block to run when none of the trigger inputs are provided). This means this block will be skipped if the trigger input (`ip_adapter_image`) is not provided at runtime.
-
-```py
-from diffusers.modular_pipelines import SequentialPipelineBlocks, InsertableDict
-inputs = [InputParam(name="ip_adapter_image")]
-block_fn = lambda x, y: print("running the ip-adapter workflow")
-block_ipa_cls = make_block(inputs=inputs, block_fn=block_fn, description="I'm a IP-adapter workflow!")
-
-class AutoIPAdapter(AutoPipelineBlocks):
-    block_classes = [block_ipa_cls]
-    block_names = ["ip-adapter"]
-    block_trigger_inputs = ["ip_adapter_image"]
-    @property
-    def description(self):
-        return "Run IP Adapter step if `ip_adapter_image` is provided."
-```
-
-Now let's combine these 2 auto blocks together into a `SequentialPipelineBlocks`:
-
-```py
-auto_ipa_blocks = AutoIPAdapter()
-blocks_dict = InsertableDict()
-blocks_dict["ip-adapter"] = auto_ipa_blocks
-blocks_dict["image-generation"] = auto_blocks
-all_blocks = SequentialPipelineBlocks.from_blocks_dict(blocks_dict)
-pipeline = all_blocks.init_pipeline()
-```
-
-Let's take a look: now things get more confusing. In this particular example, you could still try to explain the conditional logic in the `description` field here - there are only 4 possible execution paths so it's doable. However, since this is a `SequentialPipelineBlocks` that could contain many more blocks, the complexity can quickly get out of hand as the number of blocks increases.
-
-```py
->>> all_blocks
-SequentialPipelineBlocks(
-  Class: ModularPipelineBlocks
-
-  ====================================================================================================
-  This pipeline contains blocks that are selected at runtime based on inputs.
-  Trigger Inputs: ['image', 'mask', 'ip_adapter_image']
-  Use `get_execution_blocks()` with input names to see selected blocks (e.g. `get_execution_blocks('image')`).
-  ====================================================================================================
-
-
-  Description: 
-
-
-  Sub-Blocks:
-    [0] ip-adapter (AutoIPAdapter)
-       Description: Run IP Adapter step if `ip_adapter_image` is provided.
-                   
-
-    [1] image-generation (AutoImageBlocks)
-       Description: Pipeline generates images given different types of conditions!
-                   This is an auto pipeline block that works for text2img, img2img and inpainting tasks.
-                    - inpaint workflow is run when `mask` is provided.
-                    - img2img workflow is run when `image` is provided (but only when `mask` is not provided).
-                    - text2img workflow is run when neither `image` nor `mask` is provided.
-                   
-
-)
-
-```
-
-This is when the `get_execution_blocks()` method comes in handy - it basically extracts a `SequentialPipelineBlocks` that only contains the blocks that are actually run based on your inputs.
-
-Let's try some examples:
-
-`mask`: we expect it to skip the first ip-adapter since `ip_adapter_image` is not provided, and then run the inpaint for the second block.
-
-```py
->>> all_blocks.get_execution_blocks('mask')
-SequentialPipelineBlocks(
-  Class: ModularPipelineBlocks
-
-  Description: 
-
-
-  Sub-Blocks:
-    [0] image-generation (TestBlock)
-       Description: I'm a inpaint workflow!
-
-)
-```
-
-Let's also actually run the pipeline to confirm:
-
-```py
->>> _ = pipeline(mask="mask")
-skipping auto block: AutoIPAdapter
-running the inpaint workflow
-```
-
-Try a few more:
-
-```py
-print(f"inputs: ip_adapter_image:")
-blocks_select = all_blocks.get_execution_blocks('ip_adapter_image')
-print(f"expected_execution_blocks: {blocks_select}")
-print(f"actual execution blocks:")
-_ = pipeline(ip_adapter_image="ip_adapter_image", prompt="prompt")
-# expect to see ip-adapter + text2img
-
-print(f"inputs: image:")
-blocks_select = all_blocks.get_execution_blocks('image')
-print(f"expected_execution_blocks: {blocks_select}")
-print(f"actual execution blocks:")
-_ = pipeline(image="image", prompt="prompt")
-# expect to see img2img
-
-print(f"inputs: prompt:")
-blocks_select = all_blocks.get_execution_blocks('prompt')
-print(f"expected_execution_blocks: {blocks_select}")
-print(f"actual execution blocks:")
-_ = pipeline(prompt="prompt")
-# expect to see text2img (prompt is not a trigger input so fallback to default)
-
-print(f"inputs: mask + ip_adapter_image:")
-blocks_select = all_blocks.get_execution_blocks('mask','ip_adapter_image')
-print(f"expected_execution_blocks: {blocks_select}")
-print(f"actual execution blocks:")
-_ = pipeline(mask="mask", ip_adapter_image="ip_adapter_image")
-# expect to see ip-adapter + inpaint
-```
-
-In summary, `AutoPipelineBlocks` is a good tool for packaging multiple workflows into a single, convenient interface and it can greatly simplify the user experience. However, always provide clear descriptions explaining the conditional logic, test individual pipelines first before combining them, and use `get_execution_blocks()` to understand runtime behavior in complex compositions.
@@ -174,39 +174,36 @@ Feel free to open an issue if dynamic compilation doesn't work as expected for a

 ### Regional compilation

+[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by only compiling the *small and frequently-repeated block(s)* of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence.
+For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x.

-[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by compiling **only the small, frequently-repeated block(s)** of a model, typically a Transformer layer, enabling reuse of compiled artifacts for every subsequent occurrence.
-For many diffusion architectures this delivers the *same* runtime speed-ups as full-graph compilation yet cuts compile time by **8–10 ×**.
-
-To make this effortless, [`ModelMixin`] exposes [`ModelMixin.compile_repeated_blocks`] API, a helper that wraps `torch.compile` around any sub-modules you designate as repeatable:
+Use the [`~ModelMixin.compile_repeated_blocks`] method, a helper that wraps `torch.compile`, on any component such as the transformer model as shown below.

 ```py
 # pip install -U diffusers
 import torch
 from diffusers import StableDiffusionXLPipeline

-pipe = StableDiffusionXLPipeline.from_pretrained(
+pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
 ).to("cuda")

-# Compile only the repeated Transformer layers inside the UNet
-pipe.unet.compile_repeated_blocks(fullgraph=True)
+# compile only the repeated transformer layers inside the UNet
+pipeline.unet.compile_repeated_blocks(fullgraph=True)
 ```

-To enable a new model with regional compilation, add a `_repeated_blocks` attribute to your model class containing the class names (as strings) of the blocks you want compiled:
-
+To enable regional compilation for a new model, add a `_repeated_blocks` attribute to a model class containing the class names (as strings) of the blocks you want to compile.

 ```py
 class MyUNet(ModelMixin):
    _repeated_blocks = ("Transformer2DModel",)  # ← compiled by default
 ```

-For more examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705).
-
-**Relation to Accelerate compile_regions** There is also a separate API in [accelerate](https://huggingface.co/docs/accelerate/index) - [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78). It takes a fully automatic approach: it walks the module, picks candidate blocks, then compiles the remaining graph separately. That hands-off experience is handy for quick experiments, but it also leaves fewer knobs when you want to fine-tune which blocks are compiled or adjust compilation flags.
-
+> [!TIP]
+> For more regional compilation examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705).

+There is also a [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method in [Accelerate](https://huggingface.co/docs/accelerate/index) that automatically selects candidate blocks in a model to compile. The remaining graph is compiled separately. This is useful for quick experiments because there aren't as many options for you to set which blocks to compile or adjust compilation flags.

 ```py
 # pip install -U accelerate
@@ -219,8 +216,8 @@ pipeline = StableDiffusionXLPipeline.from_pretrained(
 ).to("cuda")
 pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True)
 ```
-`compile_repeated_blocks`, by contrast, is intentionally explicit. You list the repeated blocks once (via `_repeated_blocks`) and the helper compiles exactly those, nothing more. In practice this small dose of control hits a sweet spot for diffusion models: predictable behavior, easy reasoning about cache reuse, and still a one-liner for users.

+[`~ModelMixin.compile_repeated_blocks`] is intentionally explicit. List the blocks to repeat in `_repeated_blocks` and the helper only compiles those blocks. It offers predictable behavior and easy reasoning about cache reuse in one line of code.

 ### Graph breaks

@@ -296,3 +293,9 @@ An input is projected into three subspaces, represented by the projection matric
 ```py
 pipeline.fuse_qkv_projections()
 ```
+
+## Resources
+
+- Read the [Presenting Flux Fast: Making Flux go brrr on H100s](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/) blog post to learn more about how you can combine all of these optimizations with [TorchInductor](https://docs.pytorch.org/docs/stable/torch.compiler.html) and [AOTInductor](https://docs.pytorch.org/docs/stable/torch.compiler_aot_inductor.html) for a ~2.5x speedup using recipes from [flux-fast](https://github.com/huggingface/flux-fast).
+
+    These recipes support AMD hardware and [Flux.1 Kontext Dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev).
@@ -14,6 +14,9 @@ specific language governing permissions and limitations under the License.

 Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading).

+> [!TIP]
+> Check the [torch.compile](./fp16#torchcompile) guide to learn more about compilation and how they can be applied here. For example, regional compilation can significantly reduce compilation time without giving up any speedups. 
+
 For image generation, combining quantization and [model offloading](./memory#model-offloading) can often give the best trade-off between quality, speed, and memory. Group offloading is not as effective for image generation because it is usually not possible to *fully* overlap data transfer if the compute kernel finishes faster. This results in some communication overhead between the CPU and GPU.

 For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound. 
@@ -25,7 +28,7 @@ The table below provides a comparison of optimization strategy combinations and
 | quantization  | 32.602 | 14.9453 |
 | quantization, torch.compile  | 25.847 | 14.9448 |
 | quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 |
-<small>These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the <a href="https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d" benchmarking script</a> if you're interested in evaluating your own model.</small>
+<small>These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) if you're interested in evaluating your own model.</small>

 This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes.

@@ -87,6 +87,7 @@ PIXART-α Controlnet pipeline | Implementation of the controlnet model for pixar
 | CogVideoX DDIM Inversion Pipeline | Implementation of DDIM inversion and guided attention-based editing denoising process on CogVideoX. | [CogVideoX DDIM Inversion Pipeline](#cogvideox-ddim-inversion-pipeline) | - | [LittleNyima](https://github.com/LittleNyima) |
 | FaithDiff Stable Diffusion XL Pipeline | Implementation of [(CVPR 2025) FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolutionUnleashing Diffusion Priors for Faithful Image Super-resolution](https://huggingface.co/papers/2411.18824) - FaithDiff is a faithful image super-resolution method that leverages latent diffusion models by actively adapting the diffusion prior and jointly fine-tuning its components (encoder and diffusion model) with an alignment module to ensure high fidelity and structural consistency. | [FaithDiff Stable Diffusion XL Pipeline](#faithdiff-stable-diffusion-xl-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/jychen9811/FaithDiff) | [Junyang Chen, Jinshan Pan, Jiangxin Dong, IMAG Lab, (Adapted by Eliseu Silva)](https://github.com/JyChen9811/FaithDiff) |
 | Stable Diffusion 3 InstructPix2Pix Pipeline | Implementation of Stable Diffusion 3 InstructPix2Pix Pipeline | [Stable Diffusion 3 InstructPix2Pix Pipeline](#stable-diffusion-3-instructpix2pix-pipeline) | [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/BleachNick/SD3_UltraEdit_freeform) [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/CaptainZZZ/sd3-instructpix2pix) | [Jiayu Zhang](https://github.com/xduzhangjiayu) and [Haozhe Zhao](https://github.com/HaozheZhao)|
+| Flux Kontext multiple images | A modified version of the `FluxKontextPipeline` that supports calling Flux Kontext with multiple reference images.| [Flux Kontext multiple input Pipeline](#flux-kontext-multiple-images) | - |  [Net-Mist](https://github.com/Net-Mist) |
 To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly.

 ```py
@@ -5479,4 +5480,48 @@ edited_image.save("edited_image.png")
 ### Note
 This model is trained on 512x512, so input size is better on 512x512.
 For better editing performance, please refer to this powerful model https://huggingface.co/BleachNick/SD3_UltraEdit_freeform and Paper "UltraEdit: Instruction-based Fine-Grained Image
-Editing at Scale", many thanks to their contribution!
+Editing at Scale", many thanks to their contribution!
+
+# Flux Kontext multiple images
+
+This implementation of Flux Kontext allows users to pass multiple reference images. Each image is encoded separately, and the resulting latent vectors are concatenated.
+
+As explained in Section 3 of [the paper](https://arxiv.org/pdf/2506.15742), the model's sequence concatenation mechanism can extend its capabilities to handle multiple reference images. However, note that the current version of Flux Kontext was not trained for this use case. In practice, stacking along the first axis does not yield correct results, while stacking along the other two axes appears to work.
+
+## Example Usage
+
+This pipeline loads two reference images and generates a new image based on them.
+
+```python
+import torch
+
+from diffusers import FluxKontextPipeline
+from diffusers.utils import load_image
+
+
+pipe = FluxKontextPipeline.from_pretrained(
+    "black-forest-labs/FLUX.1-Kontext-dev",
+    torch_dtype=torch.bfloat16,
+    custom_pipeline="pipeline_flux_kontext_multiple_images",
+)
+pipe.to("cuda")
+
+pikachu_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png"
+).convert("RGB")
+cat_image = load_image(
+    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
+).convert("RGB")
+
+
+prompts = [
+    "Pikachu and the cat are sitting together at a pizzeria table, enjoying a delicious pizza.",
+]
+images = pipe(
+    multiple_images=[(pikachu_image, cat_image)],
+    prompt=prompts,
+    guidance_scale=2.5,
+    generator=torch.Generator().manual_seed(42),
+).images
+images[0].save("pizzeria.png")
+```
@@ -1330,7 +1330,7 @@ def main(args):
                # controlnet(s) inference
                controlnet_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype)
                controlnet_image = vae.encode(controlnet_image).latent_dist.sample()
-                controlnet_image = controlnet_image * vae.config.scaling_factor
+                controlnet_image = (controlnet_image - vae.config.shift_factor) * vae.config.scaling_factor

                control_block_res_samples = controlnet(
                    hidden_states=noisy_model_input,
@@ -1,4 +1,4 @@
-torch~=2.4.0
+torch~=2.7.0
 transformers==4.46.1
 sentencepiece
 aiohttp
@@ -1,10 +1,10 @@
 # This file was autogenerated by uv via the following command:
 #    uv pip compile requirements.in -o requirements.txt
-aiohappyeyeballs==2.4.3
+aiohappyeyeballs==2.6.1
    # via aiohttp
-aiohttp==3.10.10
+aiohttp==3.12.14
    # via -r requirements.in
-aiosignal==1.3.1
+aiosignal==1.4.0
    # via aiohttp
 annotated-types==0.7.0
    # via pydantic
@@ -29,7 +29,6 @@ filelock==3.16.1
    #   huggingface-hub
    #   torch
    #   transformers
-    #   triton
 frozenlist==1.5.0
    # via
    #   aiohttp
@@ -63,36 +62,42 @@ networkx==3.2.1
    # via torch
 numpy==2.0.2
    # via transformers
-nvidia-cublas-cu12==12.1.3.1
+nvidia-cublas-cu12==12.6.4.1
    # via
    #   nvidia-cudnn-cu12
    #   nvidia-cusolver-cu12
    #   torch
-nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-cupti-cu12==12.6.80
    # via torch
-nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.6.77
    # via torch
-nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.6.77
    # via torch
-nvidia-cudnn-cu12==9.1.0.70
+nvidia-cudnn-cu12==9.5.1.17
    # via torch
-nvidia-cufft-cu12==11.0.2.54
+nvidia-cufft-cu12==11.3.0.4
    # via torch
-nvidia-curand-cu12==10.3.2.106
+nvidia-cufile-cu12==1.11.1.6
    # via torch
-nvidia-cusolver-cu12==11.4.5.107
+nvidia-curand-cu12==10.3.7.77
    # via torch
-nvidia-cusparse-cu12==12.1.0.106
+nvidia-cusolver-cu12==11.7.1.2
+    # via torch
+nvidia-cusparse-cu12==12.5.4.2
    # via
    #   nvidia-cusolver-cu12
    #   torch
-nvidia-nccl-cu12==2.20.5
+nvidia-cusparselt-cu12==0.6.3
    # via torch
-nvidia-nvjitlink-cu12==12.9.86
+nvidia-nccl-cu12==2.26.2
+    # via torch
+nvidia-nvjitlink-cu12==12.6.85
    # via
+    #   nvidia-cufft-cu12
    #   nvidia-cusolver-cu12
    #   nvidia-cusparse-cu12
-nvidia-nvtx-cu12==12.1.105
+    #   torch
+nvidia-nvtx-cu12==12.6.77
    # via torch
 packaging==24.1
    # via
@@ -105,7 +110,9 @@ prometheus-client==0.21.0
 prometheus-fastapi-instrumentator==7.0.0
    # via -r requirements.in
 propcache==0.2.0
-    # via yarl
+    # via
+    #   aiohttp
+    #   yarl
 py-consul==1.5.3
    # via -r requirements.in
 pydantic==2.9.2
@@ -137,7 +144,7 @@ sympy==1.13.3
    # via torch
 tokenizers==0.20.1
    # via transformers
-torch==2.4.1
+torch==2.7.0
    # via -r requirements.in
 tqdm==4.66.5
    # via
@@ -145,10 +152,11 @@ tqdm==4.66.5
    #   transformers
 transformers==4.46.1
    # via -r requirements.in
-triton==3.0.0
+triton==3.3.0
    # via torch
 typing-extensions==4.12.2
    # via
+    #   aiosignal
    #   anyio
    #   exceptiongroup
    #   fastapi
@@ -163,5 +171,5 @@ urllib3==2.5.0
    # via requests
 uvicorn==0.32.0
    # via -r requirements.in
-yarl==1.16.0
+yarl==1.18.3
    # via aiohttp
@@ -110,7 +110,7 @@ _deps = [
    "jax>=0.4.1",
    "jaxlib>=0.4.1",
    "Jinja2",
-    "k-diffusion>=0.0.12",
+    "k-diffusion==0.0.12",
    "torchsde",
    "note_seq",
    "librosa",
@@ -40,6 +40,7 @@ _import_structure = {
    "models": [],
    "modular_pipelines": [],
    "pipelines": [],
+    "quantizers.pipe_quant_config": ["PipelineQuantizationConfig"],
    "quantizers.quantization_config": [],
    "schedulers": [],
    "utils": [
@@ -207,3 +207,38 @@ class IPAdapterScaleCutoffCallback(PipelineCallback):
        if step_index == cutoff_step:
            pipeline.set_ip_adapter_scale(0.0)
        return callback_kwargs
+
+
+class SD3CFGCutoffCallback(PipelineCallback):
+    """
+    Callback function for Stable Diffusion 3 Pipelines. After certain number of steps (set by `cutoff_step_ratio` or
+    `cutoff_step_index`), this callback will disable the CFG.
+
+    Note: This callback mutates the pipeline by changing the `_guidance_scale` attribute to 0.0 after the cutoff step.
+    """
+
+    tensor_inputs = ["prompt_embeds", "pooled_prompt_embeds"]
+
+    def callback_fn(self, pipeline, step_index, timestep, callback_kwargs) -> Dict[str, Any]:
+        cutoff_step_ratio = self.config.cutoff_step_ratio
+        cutoff_step_index = self.config.cutoff_step_index
+
+        # Use cutoff_step_index if it's not None, otherwise use cutoff_step_ratio
+        cutoff_step = (
+            cutoff_step_index if cutoff_step_index is not None else int(pipeline.num_timesteps * cutoff_step_ratio)
+        )
+
+        if step_index == cutoff_step:
+            prompt_embeds = callback_kwargs[self.tensor_inputs[0]]
+            prompt_embeds = prompt_embeds[-1:]  # "-1" denotes the embeddings for conditional text tokens.
+
+            pooled_prompt_embeds = callback_kwargs[self.tensor_inputs[1]]
+            pooled_prompt_embeds = pooled_prompt_embeds[
+                -1:
+            ]  # "-1" denotes the embeddings for conditional pooled text tokens.
+
+            pipeline._guidance_scale = 0.0
+
+            callback_kwargs[self.tensor_inputs[0]] = prompt_embeds
+            callback_kwargs[self.tensor_inputs[1]] = pooled_prompt_embeds
+        return callback_kwargs
@@ -17,7 +17,7 @@ deps = {
    "jax": "jax>=0.4.1",
    "jaxlib": "jaxlib>=0.4.1",
    "Jinja2": "Jinja2",
-    "k-diffusion": "k-diffusion>=0.0.12",
+    "k-diffusion": "k-diffusion==0.0.12",
    "torchsde": "torchsde",
    "note_seq": "note_seq",
    "librosa": "librosa",
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import hashlib
 import os
 from contextlib import contextmanager, nullcontext
 from dataclasses import dataclass
@@ -37,7 +38,7 @@ logger = get_logger(__name__)  # pylint: disable=invalid-name
 _GROUP_OFFLOADING = "group_offloading"
 _LAYER_EXECUTION_TRACKER = "layer_execution_tracker"
 _LAZY_PREFETCH_GROUP_OFFLOADING = "lazy_prefetch_group_offloading"
-
+_GROUP_ID_LAZY_LEAF = "lazy_leafs"
 _SUPPORTED_PYTORCH_LAYERS = (
    torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d,
    torch.nn.ConvTranspose1d, torch.nn.ConvTranspose2d, torch.nn.ConvTranspose3d,
@@ -82,6 +83,7 @@ class ModuleGroup:
        low_cpu_mem_usage: bool = False,
        onload_self: bool = True,
        offload_to_disk_path: Optional[str] = None,
+        group_id: Optional[int] = None,
    ) -> None:
        self.modules = modules
        self.offload_device = offload_device
@@ -100,7 +102,10 @@ class ModuleGroup:
        self._is_offloaded_to_disk = False

        if self.offload_to_disk_path:
-            self.safetensors_file_path = os.path.join(self.offload_to_disk_path, f"group_{id(self)}.safetensors")
+            # Instead of `group_id or str(id(self))` we do this because `group_id` can be "" as well.
+            self.group_id = group_id if group_id is not None else str(id(self))
+            short_hash = _compute_group_hash(self.group_id)
+            self.safetensors_file_path = os.path.join(self.offload_to_disk_path, f"group_{short_hash}.safetensors")

            all_tensors = []
            for module in self.modules:
@@ -609,6 +614,7 @@ def _apply_group_offloading_block_level(module: torch.nn.Module, config: GroupOf

        for i in range(0, len(submodule), config.num_blocks_per_group):
            current_modules = submodule[i : i + config.num_blocks_per_group]
+            group_id = f"{name}_{i}_{i + len(current_modules) - 1}"
            group = ModuleGroup(
                modules=current_modules,
                offload_device=config.offload_device,
@@ -621,6 +627,7 @@ def _apply_group_offloading_block_level(module: torch.nn.Module, config: GroupOf
                record_stream=config.record_stream,
                low_cpu_mem_usage=config.low_cpu_mem_usage,
                onload_self=True,
+                group_id=group_id,
            )
            matched_module_groups.append(group)
            for j in range(i, i + len(current_modules)):
@@ -655,6 +662,7 @@ def _apply_group_offloading_block_level(module: torch.nn.Module, config: GroupOf
        stream=None,
        record_stream=False,
        onload_self=True,
+        group_id=f"{module.__class__.__name__}_unmatched_group",
    )
    if config.stream is None:
        _apply_group_offloading_hook(module, unmatched_group, None, config=config)
@@ -686,6 +694,7 @@ def _apply_group_offloading_leaf_level(module: torch.nn.Module, config: GroupOff
            record_stream=config.record_stream,
            low_cpu_mem_usage=config.low_cpu_mem_usage,
            onload_self=True,
+            group_id=name,
        )
        _apply_group_offloading_hook(submodule, group, None, config=config)
        modules_with_group_offloading.add(name)
@@ -732,6 +741,7 @@ def _apply_group_offloading_leaf_level(module: torch.nn.Module, config: GroupOff
            record_stream=config.record_stream,
            low_cpu_mem_usage=config.low_cpu_mem_usage,
            onload_self=True,
+            group_id=name,
        )
        _apply_group_offloading_hook(parent_module, group, None, config=config)

@@ -753,6 +763,7 @@ def _apply_group_offloading_leaf_level(module: torch.nn.Module, config: GroupOff
            record_stream=False,
            low_cpu_mem_usage=config.low_cpu_mem_usage,
            onload_self=True,
+            group_id=_GROUP_ID_LAZY_LEAF,
        )
        _apply_lazy_group_offloading_hook(module, unmatched_group, None, config=config)

@@ -873,6 +884,12 @@ def _get_group_onload_device(module: torch.nn.Module) -> torch.device:
    raise ValueError("Group offloading is not enabled for the provided module.")


+def _compute_group_hash(group_id):
+    hashed_id = hashlib.sha256(group_id.encode("utf-8")).hexdigest()
+    # first 16 characters for a reasonably short but unique name
+    return hashed_id[:16]
+
+
 def _maybe_remove_and_reapply_group_offloading(module: torch.nn.Module) -> None:
    r"""
    Removes the group offloading hook from the module and re-applies it. This is useful when the module has been
@@ -470,7 +470,7 @@ def _func_optionally_disable_offloading(_pipeline):
            for _, component in _pipeline.components.items():
                if not isinstance(component, nn.Module) or not hasattr(component, "_hf_hook"):
                    continue
-            remove_hook_from_module(component, recurse=is_sequential_cpu_offload)
+                remove_hook_from_module(component, recurse=is_sequential_cpu_offload)

    return (is_model_cpu_offload, is_sequential_cpu_offload, is_group_offload)

@@ -24,6 +24,7 @@ from typing_extensions import Self
 from .. import __version__
 from ..quantizers import DiffusersAutoQuantizer
 from ..utils import deprecate, is_accelerate_available, logging
+from ..utils.torch_utils import device_synchronize, empty_device_cache
 from .single_file_utils import (
    SingleFileComponentError,
    convert_animatediff_checkpoint_to_diffusers,
@@ -430,6 +431,10 @@ class FromOriginalModelMixin:
                keep_in_fp32_modules=keep_in_fp32_modules,
                unexpected_keys=unexpected_keys,
            )
+            # Ensure tensors are correctly placed on device by synchronizing before returning control to user. This is
+            # required because we move tensors with non_blocking=True, which is slightly faster for model loading.
+            empty_device_cache()
+            device_synchronize()
        else:
            _, unexpected_keys = model.load_state_dict(diffusers_format_checkpoint, strict=False)

@@ -46,6 +46,7 @@ from ..utils import (
 )
 from ..utils.constants import DIFFUSERS_REQUEST_TIMEOUT
 from ..utils.hub_utils import _get_model_file
+from ..utils.torch_utils import device_synchronize, empty_device_cache


 if is_transformers_available():
@@ -1689,6 +1690,10 @@ def create_diffusers_clip_model_from_ldm(

    if is_accelerate_available():
        load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
+        # Ensure tensors are correctly placed on device by synchronizing before returning control to user. This is
+        # required because we move tensors with non_blocking=True, which is slightly faster for model loading.
+        empty_device_cache()
+        device_synchronize()
    else:
        model.load_state_dict(diffusers_format_checkpoint, strict=False)

@@ -2148,6 +2153,10 @@ def create_diffusers_t5_model_from_checkpoint(

    if is_accelerate_available():
        load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
+        # Ensure tensors are correctly placed on device by synchronizing before returning control to user. This is
+        # required because we move tensors with non_blocking=True, which is slightly faster for model loading.
+        empty_device_cache()
+        device_synchronize()
    else:
        model.load_state_dict(diffusers_format_checkpoint)

@@ -18,11 +18,8 @@ from ..models.embeddings import (
    MultiIPAdapterImageProjection,
 )
 from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
-from ..utils import (
-    is_accelerate_available,
-    is_torch_version,
-    logging,
-)
+from ..utils import is_accelerate_available, is_torch_version, logging
+from ..utils.torch_utils import device_synchronize, empty_device_cache


 if is_accelerate_available():
@@ -84,6 +81,8 @@ class FluxTransformer2DLoadersMixin:
        else:
            device_map = {"": self.device}
            load_model_dict_into_meta(image_projection, updated_state_dict, device_map=device_map, dtype=self.dtype)
+            empty_device_cache()
+            device_synchronize()

        return image_projection

@@ -158,6 +157,9 @@ class FluxTransformer2DLoadersMixin:

                key_id += 1

+        empty_device_cache()
+        device_synchronize()
+
        return attn_procs

    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
@@ -18,6 +18,7 @@ from ..models.attention_processor import SD3IPAdapterJointAttnProcessor2_0
 from ..models.embeddings import IPAdapterTimeImageProjection
 from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta
 from ..utils import is_accelerate_available, is_torch_version, logging
+from ..utils.torch_utils import device_synchronize, empty_device_cache


 logger = logging.get_logger(__name__)
@@ -80,6 +81,9 @@ class SD3Transformer2DLoadersMixin:
                    attn_procs[name], layer_state_dict[idx], device_map=device_map, dtype=self.dtype
                )

+        empty_device_cache()
+        device_synchronize()
+
        return attn_procs

    def _convert_ip_adapter_image_proj_to_diffusers(
@@ -147,6 +151,8 @@ class SD3Transformer2DLoadersMixin:
        else:
            device_map = {"": self.device}
            load_model_dict_into_meta(image_proj, updated_state_dict, device_map=device_map, dtype=self.dtype)
+            empty_device_cache()
+            device_synchronize()

        return image_proj

@@ -43,6 +43,7 @@ from ..utils import (
    is_torch_version,
    logging,
 )
+from ..utils.torch_utils import device_synchronize, empty_device_cache
 from .lora_base import _func_optionally_disable_offloading
 from .lora_pipeline import LORA_WEIGHT_NAME, LORA_WEIGHT_NAME_SAFE, TEXT_ENCODER_NAME, UNET_NAME
 from .utils import AttnProcsLayers
@@ -753,6 +754,8 @@ class UNet2DConditionLoadersMixin:
        else:
            device_map = {"": self.device}
            load_model_dict_into_meta(image_projection, updated_state_dict, device_map=device_map, dtype=self.dtype)
+            empty_device_cache()
+            device_synchronize()

        return image_projection

@@ -850,6 +853,9 @@ class UNet2DConditionLoadersMixin:

                key_id += 2

+        empty_device_cache()
+        device_synchronize()
+
        return attn_procs

    def _load_ip_adapter_weights(self, state_dicts, low_cpu_mem_usage=_LOW_CPU_MEM_USAGE_DEFAULT):
@@ -752,7 +752,7 @@ class ControlNetUnionModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
            condition = self.controlnet_cond_embedding(cond)
            feat_seq = torch.mean(condition, dim=(2, 3))
            feat_seq = feat_seq + self.task_embedding[control_idx]
-            if from_multi:
+            if from_multi or len(control_type_idx) == 1:
                inputs.append(feat_seq.unsqueeze(1))
                condition_list.append(condition)
            else:
@@ -772,7 +772,7 @@ class ControlNetUnionModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
        for (idx, condition), scale in zip(enumerate(condition_list[:-1]), conditioning_scale):
            alpha = self.spatial_ch_projs(x[:, idx])
            alpha = alpha.unsqueeze(-1).unsqueeze(-1)
-            if from_multi:
+            if from_multi or len(control_type_idx) == 1:
                controlnet_cond_fuser += condition + alpha
            else:
                controlnet_cond_fuser += condition + alpha * scale
@@ -819,11 +819,11 @@ class ControlNetUnionModel(ModelMixin, ConfigMixin, FromOriginalModelMixin):
        # 6. scaling
        if guess_mode and not self.config.global_pool_conditions:
            scales = torch.logspace(-1, 0, len(down_block_res_samples) + 1, device=sample.device)  # 0.1 to 1.0
-            if from_multi:
+            if from_multi or len(control_type_idx) == 1:
                scales = scales * conditioning_scale[0]
            down_block_res_samples = [sample * scale for sample, scale in zip(down_block_res_samples, scales)]
            mid_block_res_sample = mid_block_res_sample * scales[-1]  # last one
-        elif from_multi:
+        elif from_multi or len(control_type_idx) == 1:
            down_block_res_samples = [sample * conditioning_scale[0] for sample in down_block_res_samples]
            mid_block_res_sample = mid_block_res_sample * conditioning_scale[0]

@@ -16,9 +16,10 @@

 import importlib
 import inspect
+import math
 import os
 from array import array
-from collections import OrderedDict
+from collections import OrderedDict, defaultdict
 from pathlib import Path
 from typing import Dict, List, Optional, Union
 from zipfile import is_zipfile
@@ -38,6 +39,7 @@ from ..utils import (
    _get_model_file,
    deprecate,
    is_accelerate_available,
+    is_accelerate_version,
    is_gguf_available,
    is_torch_available,
    is_torch_version,
@@ -252,6 +254,10 @@ def load_model_dict_into_meta(
                param = param.to(dtype)
                set_module_kwargs["dtype"] = dtype

+        if is_accelerate_version(">", "1.8.1"):
+            set_module_kwargs["non_blocking"] = True
+            set_module_kwargs["clear_cache"] = False
+
        # For compatibility with PyTorch load_state_dict which converts state dict dtype to existing dtype in model, and which
        # uses `param.copy_(input_param)` that preserves the contiguity of the parameter in the model.
        # Reference: https://github.com/pytorch/pytorch/blob/db79ceb110f6646523019a59bbd7b838f43d4a86/torch/nn/modules/module.py#L2040C29-L2040C29
@@ -520,3 +526,60 @@ def load_gguf_checkpoint(gguf_checkpoint_path, return_tensors=False):
        parsed_parameters[name] = GGUFParameter(weights, quant_type=quant_type) if is_gguf_quant else weights

    return parsed_parameters
+
+
+def _find_mismatched_keys(state_dict, model_state_dict, loaded_keys, ignore_mismatched_sizes):
+    mismatched_keys = []
+    if not ignore_mismatched_sizes:
+        return mismatched_keys
+    for checkpoint_key in loaded_keys:
+        model_key = checkpoint_key
+        # If the checkpoint is sharded, we may not have the key here.
+        if checkpoint_key not in state_dict:
+            continue
+
+        if model_key in model_state_dict and state_dict[checkpoint_key].shape != model_state_dict[model_key].shape:
+            mismatched_keys.append(
+                (checkpoint_key, state_dict[checkpoint_key].shape, model_state_dict[model_key].shape)
+            )
+            del state_dict[checkpoint_key]
+    return mismatched_keys
+
+
+def _expand_device_map(device_map, param_names):
+    """
+    Expand a device map to return the correspondence parameter name to device.
+    """
+    new_device_map = {}
+    for module, device in device_map.items():
+        new_device_map.update(
+            {p: device for p in param_names if p == module or p.startswith(f"{module}.") or module == ""}
+        )
+    return new_device_map
+
+
+# Adapted from: https://github.com/huggingface/transformers/blob/0687d481e2c71544501ef9cb3eef795a6e79b1de/src/transformers/modeling_utils.py#L5859
+def _caching_allocator_warmup(model, expanded_device_map: Dict[str, torch.device], dtype: torch.dtype) -> None:
+    """
+    This function warm-ups the caching allocator based on the size of the model tensors that will reside on each
+    device. It allows to have one large call to Malloc, instead of recursively calling it later when loading the model,
+    which is actually the loading speed bottleneck. Calling this function allows to cut the model loading time by a
+    very large margin.
+    """
+    # Remove disk and cpu devices, and cast to proper torch.device
+    accelerator_device_map = {
+        param: torch.device(device)
+        for param, device in expanded_device_map.items()
+        if str(device) not in ["cpu", "disk"]
+    }
+    parameter_count = defaultdict(lambda: 0)
+    for param_name, device in accelerator_device_map.items():
+        try:
+            param = model.get_parameter(param_name)
+        except AttributeError:
+            param = model.get_buffer(param_name)
+        parameter_count[device] += math.prod(param.shape)
+
+    # This will kick off the caching allocator to avoid having to Malloc afterwards
+    for device, param_count in parameter_count.items():
+        _ = torch.empty(param_count, dtype=dtype, device=device, requires_grad=False)
@@ -62,10 +62,14 @@ from ..utils.hub_utils import (
    load_or_create_model_card,
    populate_model_card,
 )
+from ..utils.torch_utils import device_synchronize, empty_device_cache
 from .model_loading_utils import (
+    _caching_allocator_warmup,
    _determine_device_map,
+    _expand_device_map,
    _fetch_index_file,
    _fetch_index_file_legacy,
+    _find_mismatched_keys,
    _load_state_dict_into_model,
    load_model_dict_into_meta,
    load_state_dict,
@@ -1469,11 +1473,6 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
            for pat in cls._keys_to_ignore_on_load_unexpected:
                unexpected_keys = [k for k in unexpected_keys if re.search(pat, k) is None]

-        mismatched_keys = []
-
-        assign_to_params_buffers = None
-        error_msgs = []
-
        # Deal with offload
        if device_map is not None and "disk" in device_map.values():
            if offload_folder is None:
@@ -1482,18 +1481,27 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
                    " for them. Alternatively, make sure you have `safetensors` installed if the model you are using"
                    " offers the weights in this format."
                )
-            if offload_folder is not None:
+            else:
                os.makedirs(offload_folder, exist_ok=True)
            if offload_state_dict is None:
                offload_state_dict = True

+        # If a device map has been used, we can speedup the load time by warming up the device caching allocator.
+        # If we don't warmup, each tensor allocation on device calls to the allocator for memory (effectively, a
+        # lot of individual calls to device malloc). We can, however, preallocate the memory required by the
+        # tensors using their expected shape and not performing any initialization of the memory (empty data).
+        # When the actual device allocations happen, the allocator already has a pool of unused device memory
+        # that it can re-use for faster loading of the model.
+        # TODO: add support for warmup with hf_quantizer
+        if device_map is not None and hf_quantizer is None:
+            expanded_device_map = _expand_device_map(device_map, expected_keys)
+            _caching_allocator_warmup(model, expanded_device_map, dtype)
+
        offload_index = {} if device_map is not None and "disk" in device_map.values() else None
+        state_dict_folder, state_dict_index = None, None
        if offload_state_dict:
            state_dict_folder = tempfile.mkdtemp()
            state_dict_index = {}
-        else:
-            state_dict_folder = None
-            state_dict_index = None

        if state_dict is not None:
            # load_state_dict will manage the case where we pass a dict instead of a file
@@ -1503,38 +1511,14 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
        if len(resolved_model_file) > 1:
            resolved_model_file = logging.tqdm(resolved_model_file, desc="Loading checkpoint shards")

+        mismatched_keys = []
+        assign_to_params_buffers = None
+        error_msgs = []
+
        for shard_file in resolved_model_file:
            state_dict = load_state_dict(shard_file, dduf_entries=dduf_entries)
-
-            def _find_mismatched_keys(
-                state_dict,
-                model_state_dict,
-                loaded_keys,
-                ignore_mismatched_sizes,
-            ):
-                mismatched_keys = []
-                if ignore_mismatched_sizes:
-                    for checkpoint_key in loaded_keys:
-                        model_key = checkpoint_key
-                        # If the checkpoint is sharded, we may not have the key here.
-                        if checkpoint_key not in state_dict:
-                            continue
-
-                        if (
-                            model_key in model_state_dict
-                            and state_dict[checkpoint_key].shape != model_state_dict[model_key].shape
-                        ):
-                            mismatched_keys.append(
-                                (checkpoint_key, state_dict[checkpoint_key].shape, model_state_dict[model_key].shape)
-                            )
-                            del state_dict[checkpoint_key]
-                return mismatched_keys
-
            mismatched_keys += _find_mismatched_keys(
-                state_dict,
-                model_state_dict,
-                loaded_keys,
-                ignore_mismatched_sizes,
+                state_dict, model_state_dict, loaded_keys, ignore_mismatched_sizes
            )

            if low_cpu_mem_usage:
@@ -1554,9 +1538,13 @@ class ModelMixin(torch.nn.Module, PushToHubMixin):
            else:
                if assign_to_params_buffers is None:
                    assign_to_params_buffers = check_support_param_buffer_assignment(model, state_dict)
-
                error_msgs += _load_state_dict_into_model(model, state_dict, assign_to_params_buffers)

+        # Ensure tensors are correctly placed on device by synchronizing before returning control to user. This is
+        # required because we move tensors with non_blocking=True, which is slightly faster for model loading.
+        empty_device_cache()
+        device_synchronize()
+
        if offload_index is not None and len(offload_index) > 0:
            save_offload_index(offload_index, offload_folder)
            offload_index = None
@@ -187,9 +187,15 @@ class CosmosAttnProcessor2_0:
            key = apply_rotary_emb(key, image_rotary_emb, use_real=True, use_real_unbind_dim=-2)

        # 4. Prepare for GQA
-        query_idx = torch.tensor(query.size(3), device=query.device)
-        key_idx = torch.tensor(key.size(3), device=key.device)
-        value_idx = torch.tensor(value.size(3), device=value.device)
+        if torch.onnx.is_in_onnx_export():
+            query_idx = torch.tensor(query.size(3), device=query.device)
+            key_idx = torch.tensor(key.size(3), device=key.device)
+            value_idx = torch.tensor(value.size(3), device=value.device)
+
+        else:
+            query_idx = query.size(3)
+            key_idx = key.size(3)
+            value_idx = value.size(3)
        key = key.repeat_interleave(query_idx // key_idx, dim=3)
        value = value.repeat_interleave(query_idx // value_idx, dim=3)

@@ -490,6 +490,7 @@ class FluxTransformer2DModel(
                    encoder_hidden_states,
                    temb,
                    image_rotary_emb,
+                    joint_attention_kwargs,
                )

            else:
@@ -521,6 +522,7 @@ class FluxTransformer2DModel(
                    encoder_hidden_states,
                    temb,
                    image_rotary_emb,
+                    joint_attention_kwargs,
                )

            else:
@@ -323,6 +323,7 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
    """

    config_name = "config.json"
+    model_name = None

    @classmethod
    def _get_signature_keys(cls, obj):
@@ -333,6 +334,14 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):

        return expected_modules, optional_parameters

+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return []
+
+    @property
+    def expected_configs(self) -> List[ConfigSpec]:
+        return []
+
    @classmethod
    def from_pretrained(
        cls,
@@ -358,7 +367,9 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
            trust_remote_code, pretrained_model_name_or_path, has_remote_code
        )
        if not (has_remote_code and trust_remote_code):
-            raise ValueError("TODO")
+            raise ValueError(
+                "Selected model repository does not happear to have any custom code or does not have a valid `config.json` file."
+            )

        class_ref = config["auto_map"][cls.__name__]
        module_file, class_name = class_ref.split(".")
@@ -367,7 +378,6 @@ class ModularPipelineBlocks(ConfigMixin, PushToHubMixin):
            pretrained_model_name_or_path,
            module_file=module_file,
            class_name=class_name,
-            is_modular=True,
            **hub_kwargs,
            **kwargs,
        )
@@ -93,7 +93,7 @@ class ComponentSpec:
    config: Optional[FrozenDict] = None
    # YiYi Notes: should we change it to pretrained_model_name_or_path for consistency? a bit long for a field name
    repo: Optional[Union[str, List[str]]] = field(default=None, metadata={"loading": True})
-    subfolder: Optional[str] = field(default=None, metadata={"loading": True})
+    subfolder: Optional[str] = field(default="", metadata={"loading": True})
    variant: Optional[str] = field(default=None, metadata={"loading": True})
    revision: Optional[str] = field(default=None, metadata={"loading": True})
    default_creation_method: Literal["from_config", "from_pretrained"] = "from_pretrained"
@@ -19,8 +19,9 @@ import PIL
 import torch

 from ...configuration_utils import FrozenDict
+from ...guiders import ClassifierFreeGuidance
 from ...image_processor import VaeImageProcessor
-from ...models import AutoencoderKL, ControlNetModel, ControlNetUnionModel
+from ...models import AutoencoderKL, ControlNetModel, ControlNetUnionModel, UNet2DConditionModel
 from ...pipelines.controlnet.multicontrolnet import MultiControlNetModel
 from ...schedulers import EulerDiscreteScheduler
 from ...utils import logging
@@ -266,37 +267,37 @@ class StableDiffusionXLInputStep(PipelineBlock):
            OutputParam(
                "prompt_embeds",
                type_hint=torch.Tensor,
-                kwargs_type="guider_input_fields",
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
                description="text embeddings used to guide the image generation",
            ),
            OutputParam(
                "negative_prompt_embeds",
                type_hint=torch.Tensor,
-                kwargs_type="guider_input_fields",
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
                description="negative text embeddings used to guide the image generation",
            ),
            OutputParam(
                "pooled_prompt_embeds",
                type_hint=torch.Tensor,
-                kwargs_type="guider_input_fields",
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
                description="pooled text embeddings used to guide the image generation",
            ),
            OutputParam(
                "negative_pooled_prompt_embeds",
                type_hint=torch.Tensor,
-                kwargs_type="guider_input_fields",
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
                description="negative pooled text embeddings used to guide the image generation",
            ),
            OutputParam(
                "ip_adapter_embeds",
                type_hint=List[torch.Tensor],
-                kwargs_type="guider_input_fields",
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
                description="image embeddings for IP-Adapter",
            ),
            OutputParam(
                "negative_ip_adapter_embeds",
                type_hint=List[torch.Tensor],
-                kwargs_type="guider_input_fields",
+                kwargs_type="guider_input_fields",  # already in intermedites state but declare here again for guider_input_fields
                description="negative image embeddings for IP-Adapter",
            ),
        ]
@@ -683,12 +684,6 @@ class StableDiffusionXLInpaintPrepareLatentsStep(PipelineBlock):
            OutputParam(
                "latents", type_hint=torch.Tensor, description="The initial latents to use for the denoising process"
            ),
-            OutputParam("mask", type_hint=torch.Tensor, description="The mask to use for inpainting generation"),
-            OutputParam(
-                "masked_image_latents",
-                type_hint=torch.Tensor,
-                description="The masked image latents to use for the inpainting generation (only for inpainting-specific unet)",
-            ),
            OutputParam(
                "noise",
                type_hint=torch.Tensor,
@@ -993,6 +988,7 @@ class StableDiffusionXLPrepareLatentsStep(PipelineBlock):
    def expected_components(self) -> List[ComponentSpec]:
        return [
            ComponentSpec("scheduler", EulerDiscreteScheduler),
+            ComponentSpec("vae", AutoencoderKL),
        ]

    @property
@@ -1105,6 +1101,18 @@ class StableDiffusionXLImg2ImgPrepareAdditionalConditioningStep(PipelineBlock):
            ConfigSpec("requires_aesthetics_score", False),
        ]

+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("unet", UNet2DConditionModel),
+            ComponentSpec(
+                "guider",
+                ClassifierFreeGuidance,
+                config=FrozenDict({"guidance_scale": 7.5}),
+                default_creation_method="from_config",
+            ),
+        ]
+
    @property
    def description(self) -> str:
        return "Step that prepares the additional conditioning for the image-to-image/inpainting generation process"
@@ -1315,6 +1323,18 @@ class StableDiffusionXLPrepareAdditionalConditioningStep(PipelineBlock):
    def description(self) -> str:
        return "Step that prepares the additional conditioning for the text-to-image generation process"

+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("unet", UNet2DConditionModel),
+            ComponentSpec(
+                "guider",
+                ClassifierFreeGuidance,
+                config=FrozenDict({"guidance_scale": 7.5}),
+                default_creation_method="from_config",
+            ),
+        ]
+
    @property
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
@@ -167,6 +167,17 @@ class StableDiffusionXLInpaintOverlayMaskStep(PipelineBlock):
            + "only needed when you are using the `padding_mask_crop` option when pre-processing the image and mask"
        )

+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec(
+                "image_processor",
+                VaeImageProcessor,
+                config=FrozenDict({"vae_scale_factor": 8}),
+                default_creation_method="from_config",
+            ),
+        ]
+
    @property
    def inputs(self) -> List[Tuple[str, Any]]:
        return [
@@ -190,16 +201,6 @@ class StableDiffusionXLInpaintOverlayMaskStep(PipelineBlock):
            ),
        ]

-    @property
-    def intermediate_outputs(self) -> List[str]:
-        return [
-            OutputParam(
-                "images",
-                type_hint=Union[List[PIL.Image.Image], List[torch.Tensor], List[np.array]],
-                description="The generated images with the mask overlayed",
-            )
-        ]
-
    @torch.no_grad()
    def __call__(self, components, state: PipelineState) -> PipelineState:
        block_state = self.get_block_state(state)
@@ -91,7 +91,8 @@ class StableDiffusionXLModularPipeline(
        return num_channels_latents


-# YiYi Notes: not used yet, maintain a list of schema that can be used across all pipeline blocks
+# YiYi/Sayak TODO: not used yet, maintain a list of schema that can be used across all pipeline blocks
+# auto_docstring
 SDXL_INPUTS_SCHEMA = {
    "prompt": InputParam(
        "prompt", type_hint=Union[str, List[str]], description="The prompt or prompts to guide the image generation"
@@ -18,7 +18,6 @@ from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 import numpy as np
 import PIL.Image
 import torch
-import torch.nn.functional as F
 from transformers import (
    CLIPImageProcessor,
    CLIPTextModel,
@@ -35,7 +34,13 @@ from ...loaders import (
    StableDiffusionXLLoraLoaderMixin,
    TextualInversionLoaderMixin,
 )
-from ...models import AutoencoderKL, ControlNetModel, ControlNetUnionModel, ImageProjection, UNet2DConditionModel
+from ...models import (
+    AutoencoderKL,
+    ControlNetUnionModel,
+    ImageProjection,
+    MultiControlNetUnionModel,
+    UNet2DConditionModel,
+)
 from ...models.attention_processor import (
    AttnProcessor2_0,
    XFormersAttnProcessor,
@@ -230,7 +235,9 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        tokenizer: CLIPTokenizer,
        tokenizer_2: CLIPTokenizer,
        unet: UNet2DConditionModel,
-        controlnet: ControlNetUnionModel,
+        controlnet: Union[
+            ControlNetUnionModel, List[ControlNetUnionModel], Tuple[ControlNetUnionModel], MultiControlNetUnionModel
+        ],
        scheduler: KarrasDiffusionSchedulers,
        requires_aesthetics_score: bool = False,
        force_zeros_for_empty_prompt: bool = True,
@@ -240,8 +247,8 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
    ):
        super().__init__()

-        if not isinstance(controlnet, ControlNetUnionModel):
-            raise ValueError("Expected `controlnet` to be of type `ControlNetUnionModel`.")
+        if isinstance(controlnet, (list, tuple)):
+            controlnet = MultiControlNetUnionModel(controlnet)

        self.register_modules(
            vae=vae,
@@ -660,6 +667,7 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        controlnet_conditioning_scale=1.0,
        control_guidance_start=0.0,
        control_guidance_end=1.0,
+        control_mode=None,
        callback_on_step_end_tensor_inputs=None,
        padding_mask_crop=None,
    ):
@@ -747,25 +755,34 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
                "If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed. Make sure to generate `negative_pooled_prompt_embeds` from the same text encoder that was used to generate `negative_prompt_embeds`."
            )

-        # Check `image`
-        is_compiled = hasattr(F, "scaled_dot_product_attention") and isinstance(
-            self.controlnet, torch._dynamo.eval_frame.OptimizedModule
-        )
-        if (
-            isinstance(self.controlnet, ControlNetModel)
-            or is_compiled
-            and isinstance(self.controlnet._orig_mod, ControlNetModel)
-        ):
-            self.check_image(image, prompt, prompt_embeds)
-        elif (
-            isinstance(self.controlnet, ControlNetUnionModel)
-            or is_compiled
-            and isinstance(self.controlnet._orig_mod, ControlNetUnionModel)
-        ):
-            self.check_image(image, prompt, prompt_embeds)
+        # `prompt` needs more sophisticated handling when there are multiple
+        # conditionings.
+        if isinstance(self.controlnet, MultiControlNetUnionModel):
+            if isinstance(prompt, list):
+                logger.warning(
+                    f"You have {len(self.controlnet.nets)} ControlNets and you have passed {len(prompt)}"
+                    " prompts. The conditionings will be fixed across the prompts."
+                )

-        else:
-            assert False
+        # Check `image`
+        controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet
+
+        if isinstance(controlnet, ControlNetUnionModel):
+            for image_ in image:
+                self.check_image(image_, prompt, prompt_embeds)
+        elif isinstance(controlnet, MultiControlNetUnionModel):
+            if not isinstance(image, list):
+                raise TypeError("For multiple controlnets: `image` must be type `list`")
+            elif not all(isinstance(i, list) for i in image):
+                raise ValueError("For multiple controlnets: elements of `image` must be list of conditionings.")
+            elif len(image) != len(self.controlnet.nets):
+                raise ValueError(
+                    f"For multiple controlnets: `image` must have the same length as the number of controlnets, but got {len(image)} images and {len(self.controlnet.nets)} ControlNets."
+                )
+
+            for images_ in image:
+                for image_ in images_:
+                    self.check_image(image_, prompt, prompt_embeds)

        if not isinstance(control_guidance_start, (tuple, list)):
            control_guidance_start = [control_guidance_start]
@@ -778,6 +795,12 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
                f"`control_guidance_start` has {len(control_guidance_start)} elements, but `control_guidance_end` has {len(control_guidance_end)} elements. Make sure to provide the same number of elements to each list."
            )

+        if isinstance(controlnet, MultiControlNetUnionModel):
+            if len(control_guidance_start) != len(self.controlnet.nets):
+                raise ValueError(
+                    f"`control_guidance_start`: {control_guidance_start} has {len(control_guidance_start)} elements but there are {len(self.controlnet.nets)} controlnets available. Make sure to provide {len(self.controlnet.nets)}."
+                )
+
        for start, end in zip(control_guidance_start, control_guidance_end):
            if start >= end:
                raise ValueError(
@@ -788,6 +811,28 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
            if end > 1.0:
                raise ValueError(f"control guidance end: {end} can't be larger than 1.0.")

+        # Check `control_mode`
+        if isinstance(controlnet, ControlNetUnionModel):
+            if max(control_mode) >= controlnet.config.num_control_type:
+                raise ValueError(f"control_mode: must be lower than {controlnet.config.num_control_type}.")
+        elif isinstance(controlnet, MultiControlNetUnionModel):
+            for _control_mode, _controlnet in zip(control_mode, self.controlnet.nets):
+                if max(_control_mode) >= _controlnet.config.num_control_type:
+                    raise ValueError(f"control_mode: must be lower than {_controlnet.config.num_control_type}.")
+
+        # Equal number of `image` and `control_mode` elements
+        if isinstance(controlnet, ControlNetUnionModel):
+            if len(image) != len(control_mode):
+                raise ValueError("Expected len(control_image) == len(control_mode)")
+        elif isinstance(controlnet, MultiControlNetUnionModel):
+            if not all(isinstance(i, list) for i in control_mode):
+                raise ValueError(
+                    "For multiple controlnets: elements of control_mode must be lists representing conditioning mode."
+                )
+
+            elif sum(len(x) for x in image) != sum(len(x) for x in control_mode):
+                raise ValueError("Expected len(control_image) == len(control_mode)")
+
        if ip_adapter_image is not None and ip_adapter_image_embeds is not None:
            raise ValueError(
                "Provide either `ip_adapter_image` or `ip_adapter_image_embeds`. Cannot leave both `ip_adapter_image` and `ip_adapter_image_embeds` defined."
@@ -1117,7 +1162,7 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        prompt_2: Optional[Union[str, List[str]]] = None,
        image: PipelineImageInput = None,
        mask_image: PipelineImageInput = None,
-        control_image: PipelineImageInput = None,
+        control_image: Union[PipelineImageInput, List[PipelineImageInput]] = None,
        height: Optional[int] = None,
        width: Optional[int] = None,
        padding_mask_crop: Optional[int] = None,
@@ -1145,7 +1190,7 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        guess_mode: bool = False,
        control_guidance_start: Union[float, List[float]] = 0.0,
        control_guidance_end: Union[float, List[float]] = 1.0,
-        control_mode: Optional[Union[int, List[int]]] = None,
+        control_mode: Optional[Union[int, List[int], List[List[int]]]] = None,
        guidance_rescale: float = 0.0,
        original_size: Tuple[int, int] = None,
        crops_coords_top_left: Tuple[int, int] = (0, 0),
@@ -1177,6 +1222,13 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
                repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be converted
                to a single channel (luminance) before use. If it's a tensor, it should contain one color channel (L)
                instead of 3, so the expected shape would be `(B, H, W, 1)`.
+            control_image (`PipelineImageInput` or `List[PipelineImageInput]`, *optional*):
+                The ControlNet input condition to provide guidance to the `unet` for generation. If the type is
+                specified as `torch.Tensor`, it is passed to ControlNet as is. `PIL.Image.Image` can also be accepted
+                as an image. The dimensions of the output image defaults to `image`'s dimensions. If height and/or
+                width are passed, `image` is resized accordingly. If multiple ControlNets are specified in `init`,
+                images must be passed as a list such that each element of the list can be correctly batched for input
+                to a single ControlNet.
            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
                The height in pixels of the generated image.
            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
@@ -1269,6 +1321,22 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
                `self.processor` in
                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            controlnet_conditioning_scale (`float` or `List[float]`, *optional*, defaults to 1.0):
+                The outputs of the ControlNet are multiplied by `controlnet_conditioning_scale` before they are added
+                to the residual in the original `unet`. If multiple ControlNets are specified in `init`, you can set
+                the corresponding scale as a list.
+            guess_mode (`bool`, *optional*, defaults to `False`):
+                The ControlNet encoder tries to recognize the content of the input image even if you remove all
+                prompts. A `guidance_scale` value between 3.0 and 5.0 is recommended.
+            control_guidance_start (`float` or `List[float]`, *optional*, defaults to 0.0):
+                The percentage of total steps at which the ControlNet starts applying.
+            control_guidance_end (`float` or `List[float]`, *optional*, defaults to 1.0):
+                The percentage of total steps at which the ControlNet stops applying.
+            control_mode (`int` or `List[int]` or `List[List[int]], *optional*):
+                The control condition types for the ControlNet. See the ControlNet's model card forinformation on the
+                available control modes. If multiple ControlNets are specified in `init`, control_mode should be a list
+                where each ControlNet should have its corresponding control mode list. Should reflect the order of
+                conditions in control_image.
            original_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
                If `original_size` is not the same as `target_size` the image will appear to be down- or upsampled.
                `original_size` defaults to `(width, height)` if not specified. Part of SDXL's micro-conditioning as
@@ -1333,22 +1401,6 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(

        controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet

-        # align format for control guidance
-        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
-            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
-        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
-            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
-
-        # # 0.0 Default height and width to unet
-        # height = height or self.unet.config.sample_size * self.vae_scale_factor
-        # width = width or self.unet.config.sample_size * self.vae_scale_factor
-
-        # 0.1 align format for control guidance
-        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
-            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
-        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
-            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
-
        if not isinstance(control_image, list):
            control_image = [control_image]
        else:
@@ -1357,40 +1409,59 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        if not isinstance(control_mode, list):
            control_mode = [control_mode]

-        if len(control_image) != len(control_mode):
-            raise ValueError("Expected len(control_image) == len(control_type)")
+        if isinstance(controlnet, MultiControlNetUnionModel):
+            control_image = [[item] for item in control_image]
+            control_mode = [[item] for item in control_mode]

-        num_control_type = controlnet.config.num_control_type
-
-        # 1. Check inputs
-        control_type = [0 for _ in range(num_control_type)]
-        for _image, control_idx in zip(control_image, control_mode):
-            control_type[control_idx] = 1
-            self.check_inputs(
-                prompt,
-                prompt_2,
-                _image,
-                mask_image,
-                strength,
-                num_inference_steps,
-                callback_steps,
-                output_type,
-                negative_prompt,
-                negative_prompt_2,
-                prompt_embeds,
-                negative_prompt_embeds,
-                ip_adapter_image,
-                ip_adapter_image_embeds,
-                pooled_prompt_embeds,
-                negative_pooled_prompt_embeds,
-                controlnet_conditioning_scale,
-                control_guidance_start,
-                control_guidance_end,
-                callback_on_step_end_tensor_inputs,
-                padding_mask_crop,
+        # align format for control guidance
+        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
+            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
+        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
+            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
+        elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list):
+            mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetUnionModel) else len(control_mode)
+            control_guidance_start, control_guidance_end = (
+                mult * [control_guidance_start],
+                mult * [control_guidance_end],
            )

-        control_type = torch.Tensor(control_type)
+        if isinstance(controlnet_conditioning_scale, float):
+            mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetUnionModel) else len(control_mode)
+            controlnet_conditioning_scale = [controlnet_conditioning_scale] * mult
+
+        # 1. Check inputs
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            control_image,
+            mask_image,
+            strength,
+            num_inference_steps,
+            callback_steps,
+            output_type,
+            negative_prompt,
+            negative_prompt_2,
+            prompt_embeds,
+            negative_prompt_embeds,
+            ip_adapter_image,
+            ip_adapter_image_embeds,
+            pooled_prompt_embeds,
+            negative_pooled_prompt_embeds,
+            controlnet_conditioning_scale,
+            control_guidance_start,
+            control_guidance_end,
+            control_mode,
+            callback_on_step_end_tensor_inputs,
+            padding_mask_crop,
+        )
+
+        if isinstance(controlnet, ControlNetUnionModel):
+            control_type = torch.zeros(controlnet.config.num_control_type).scatter_(0, torch.tensor(control_mode), 1)
+        elif isinstance(controlnet, MultiControlNetUnionModel):
+            control_type = [
+                torch.zeros(controlnet_.config.num_control_type).scatter_(0, torch.tensor(control_mode_), 1)
+                for control_mode_, controlnet_ in zip(control_mode, self.controlnet.nets)
+            ]

        self._guidance_scale = guidance_scale
        self._clip_skip = clip_skip
@@ -1483,21 +1554,55 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        init_image = init_image.to(dtype=torch.float32)

        # 5.2 Prepare control images
-        for idx, _ in enumerate(control_image):
-            control_image[idx] = self.prepare_control_image(
-                image=control_image[idx],
-                width=width,
-                height=height,
-                batch_size=batch_size * num_images_per_prompt,
-                num_images_per_prompt=num_images_per_prompt,
-                device=device,
-                dtype=controlnet.dtype,
-                crops_coords=crops_coords,
-                resize_mode=resize_mode,
-                do_classifier_free_guidance=self.do_classifier_free_guidance,
-                guess_mode=guess_mode,
-            )
-            height, width = control_image[idx].shape[-2:]
+        if isinstance(controlnet, ControlNetUnionModel):
+            control_images = []
+
+            for image_ in control_image:
+                image_ = self.prepare_control_image(
+                    image=image_,
+                    width=width,
+                    height=height,
+                    batch_size=batch_size * num_images_per_prompt,
+                    num_images_per_prompt=num_images_per_prompt,
+                    device=device,
+                    dtype=controlnet.dtype,
+                    crops_coords=crops_coords,
+                    resize_mode=resize_mode,
+                    do_classifier_free_guidance=self.do_classifier_free_guidance,
+                    guess_mode=guess_mode,
+                )
+
+                control_images.append(image_)
+
+            control_image = control_images
+            height, width = control_image[0].shape[-2:]
+
+        elif isinstance(controlnet, MultiControlNetUnionModel):
+            control_images = []
+
+            for control_image_ in control_image:
+                images = []
+
+                for image_ in control_image_:
+                    image_ = self.prepare_control_image(
+                        image=image_,
+                        width=width,
+                        height=height,
+                        batch_size=batch_size * num_images_per_prompt,
+                        num_images_per_prompt=num_images_per_prompt,
+                        device=device,
+                        dtype=controlnet.dtype,
+                        crops_coords=crops_coords,
+                        resize_mode=resize_mode,
+                        do_classifier_free_guidance=self.do_classifier_free_guidance,
+                        guess_mode=guess_mode,
+                    )
+
+                    images.append(image_)
+                control_images.append(images)
+
+            control_image = control_images
+            height, width = control_image[0][0].shape[-2:]

        # 5.3 Prepare mask
        mask = self.mask_processor.preprocess(
@@ -1559,10 +1664,11 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
        # 8.2 Create tensor stating which controlnets to keep
        controlnet_keep = []
        for i in range(len(timesteps)):
-            controlnet_keep.append(
-                1.0
-                - float(i / len(timesteps) < control_guidance_start or (i + 1) / len(timesteps) > control_guidance_end)
-            )
+            keeps = [
+                1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
+                for s, e in zip(control_guidance_start, control_guidance_end)
+            ]
+            controlnet_keep.append(keeps)

        # 9. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
        height, width = latents.shape[-2:]
@@ -1627,11 +1733,24 @@ class StableDiffusionXLControlNetUnionInpaintPipeline(
            num_inference_steps = len(list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps)))
            timesteps = timesteps[:num_inference_steps]

-        control_type = (
-            control_type.reshape(1, -1)
-            .to(device, dtype=prompt_embeds.dtype)
-            .repeat(batch_size * num_images_per_prompt * 2, 1)
+        control_type_repeat_factor = (
+            batch_size * num_images_per_prompt * (2 if self.do_classifier_free_guidance else 1)
        )
+
+        if isinstance(controlnet, ControlNetUnionModel):
+            control_type = (
+                control_type.reshape(1, -1)
+                .to(self._execution_device, dtype=prompt_embeds.dtype)
+                .repeat(control_type_repeat_factor, 1)
+            )
+        elif isinstance(controlnet, MultiControlNetUnionModel):
+            control_type = [
+                _control_type.reshape(1, -1)
+                .to(self._execution_device, dtype=prompt_embeds.dtype)
+                .repeat(control_type_repeat_factor, 1)
+                for _control_type in control_type
+            ]
+
        with self.progress_bar(total=num_inference_steps) as progress_bar:
            for i, t in enumerate(timesteps):
                if self.interrupt:
@@ -1452,17 +1452,21 @@ class StableDiffusionXLControlNetUnionPipeline(
        is_controlnet_compiled = is_compiled_module(self.controlnet)
        is_torch_higher_equal_2_1 = is_torch_version(">=", "2.1")

+        control_type_repeat_factor = (
+            batch_size * num_images_per_prompt * (2 if self.do_classifier_free_guidance else 1)
+        )
+
        if isinstance(controlnet, ControlNetUnionModel):
            control_type = (
                control_type.reshape(1, -1)
                .to(self._execution_device, dtype=prompt_embeds.dtype)
-                .repeat(batch_size * num_images_per_prompt * 2, 1)
+                .repeat(control_type_repeat_factor, 1)
            )
-        if isinstance(controlnet, MultiControlNetUnionModel):
+        elif isinstance(controlnet, MultiControlNetUnionModel):
            control_type = [
                _control_type.reshape(1, -1)
                .to(self._execution_device, dtype=prompt_embeds.dtype)
-                .repeat(batch_size * num_images_per_prompt * 2, 1)
+                .repeat(control_type_repeat_factor, 1)
                for _control_type in control_type
            ]

@@ -1096,6 +1096,8 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin):
        model.register_to_config(_name_or_path=pretrained_model_name_or_path)
        if device_map is not None:
            setattr(model, "hf_device_map", final_device_map)
+        if quantization_config is not None:
+            setattr(model, "quantization_config", quantization_config)
        return model

    @property
@@ -25,6 +25,7 @@ from transformers import (
    T5TokenizerFast,
 )

+from ...callbacks import MultiPipelineCallbacks, PipelineCallback
 from ...image_processor import PipelineImageInput, VaeImageProcessor
 from ...loaders import FromSingleFileMixin, SD3IPAdapterMixin, SD3LoraLoaderMixin
 from ...models.autoencoders import AutoencoderKL
@@ -184,7 +185,7 @@ class StableDiffusion3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingle

    model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->image_encoder->transformer->vae"
    _optional_components = ["image_encoder", "feature_extractor"]
-    _callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]
+    _callback_tensor_inputs = ["latents", "prompt_embeds", "pooled_prompt_embeds"]

    def __init__(
        self,
@@ -923,6 +924,9 @@ class StableDiffusion3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingle
        height = height or self.default_sample_size * self.vae_scale_factor
        width = width or self.default_sample_size * self.vae_scale_factor

+        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
+            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
+
        # 1. Check inputs. Raise error if not correct
        self.check_inputs(
            prompt,
@@ -1109,10 +1113,7 @@ class StableDiffusion3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingle

                    latents = callback_outputs.pop("latents", latents)
                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop("negative_prompt_embeds", negative_prompt_embeds)
-                    negative_pooled_prompt_embeds = callback_outputs.pop(
-                        "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
-                    )
+                    pooled_prompt_embeds = callback_outputs.pop("pooled_prompt_embeds", pooled_prompt_embeds)

                # call the callback, if provided
                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
@@ -12,183 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import inspect
-from typing import Dict, List, Optional, Union

-from ..utils import is_transformers_available, logging
 from .auto import DiffusersAutoQuantizer
 from .base import DiffusersQuantizer
-from .quantization_config import QuantizationConfigMixin as DiffQuantConfigMixin
-
-
-try:
-    from transformers.utils.quantization_config import QuantizationConfigMixin as TransformersQuantConfigMixin
-except ImportError:
-
-    class TransformersQuantConfigMixin:
-        pass
-
-
-logger = logging.get_logger(__name__)
-
-
-class PipelineQuantizationConfig:
-    """
-    Configuration class to be used when applying quantization on-the-fly to [`~DiffusionPipeline.from_pretrained`].
-
-    Args:
-        quant_backend (`str`): Quantization backend to be used. When using this option, we assume that the backend
-            is available to both `diffusers` and `transformers`.
-        quant_kwargs (`dict`): Params to initialize the quantization backend class.
-        components_to_quantize (`list`): Components of a pipeline to be quantized.
-        quant_mapping (`dict`): Mapping defining the quantization specs to be used for the pipeline
-            components. When using this argument, users are not expected to provide `quant_backend`, `quant_kawargs`,
-            and `components_to_quantize`.
-    """
-
-    def __init__(
-        self,
-        quant_backend: str = None,
-        quant_kwargs: Dict[str, Union[str, float, int, dict]] = None,
-        components_to_quantize: Optional[List[str]] = None,
-        quant_mapping: Dict[str, Union[DiffQuantConfigMixin, "TransformersQuantConfigMixin"]] = None,
-    ):
-        self.quant_backend = quant_backend
-        # Initialize kwargs to be {} to set to the defaults.
-        self.quant_kwargs = quant_kwargs or {}
-        self.components_to_quantize = components_to_quantize
-        self.quant_mapping = quant_mapping
-
-        self.post_init()
-
-    def post_init(self):
-        quant_mapping = self.quant_mapping
-        self.is_granular = True if quant_mapping is not None else False
-
-        self._validate_init_args()
-
-    def _validate_init_args(self):
-        if self.quant_backend and self.quant_mapping:
-            raise ValueError("Both `quant_backend` and `quant_mapping` cannot be specified at the same time.")
-
-        if not self.quant_mapping and not self.quant_backend:
-            raise ValueError("Must provide a `quant_backend` when not providing a `quant_mapping`.")
-
-        if not self.quant_kwargs and not self.quant_mapping:
-            raise ValueError("Both `quant_kwargs` and `quant_mapping` cannot be None.")
-
-        if self.quant_backend is not None:
-            self._validate_init_kwargs_in_backends()
-
-        if self.quant_mapping is not None:
-            self._validate_quant_mapping_args()
-
-    def _validate_init_kwargs_in_backends(self):
-        quant_backend = self.quant_backend
-
-        self._check_backend_availability(quant_backend)
-
-        quant_config_mapping_transformers, quant_config_mapping_diffusers = self._get_quant_config_list()
-
-        if quant_config_mapping_transformers is not None:
-            init_kwargs_transformers = inspect.signature(quant_config_mapping_transformers[quant_backend].__init__)
-            init_kwargs_transformers = {name for name in init_kwargs_transformers.parameters if name != "self"}
-        else:
-            init_kwargs_transformers = None
-
-        init_kwargs_diffusers = inspect.signature(quant_config_mapping_diffusers[quant_backend].__init__)
-        init_kwargs_diffusers = {name for name in init_kwargs_diffusers.parameters if name != "self"}
-
-        if init_kwargs_transformers != init_kwargs_diffusers:
-            raise ValueError(
-                "The signatures of the __init__ methods of the quantization config classes in `diffusers` and `transformers` don't match. "
-                f"Please provide a `quant_mapping` instead, in the {self.__class__.__name__} class. Refer to [the docs](https://huggingface.co/docs/diffusers/main/en/quantization/overview#pipeline-level-quantization) to learn more about how "
-                "this mapping would look like."
-            )
-
-    def _validate_quant_mapping_args(self):
-        quant_mapping = self.quant_mapping
-        transformers_map, diffusers_map = self._get_quant_config_list()
-
-        available_transformers = list(transformers_map.values()) if transformers_map else None
-        available_diffusers = list(diffusers_map.values())
-
-        for module_name, config in quant_mapping.items():
-            if any(isinstance(config, cfg) for cfg in available_diffusers):
-                continue
-
-            if available_transformers and any(isinstance(config, cfg) for cfg in available_transformers):
-                continue
-
-            if available_transformers:
-                raise ValueError(
-                    f"Provided config for module_name={module_name} could not be found. "
-                    f"Available diffusers configs: {available_diffusers}; "
-                    f"Available transformers configs: {available_transformers}."
-                )
-            else:
-                raise ValueError(
-                    f"Provided config for module_name={module_name} could not be found. "
-                    f"Available diffusers configs: {available_diffusers}."
-                )
-
-    def _check_backend_availability(self, quant_backend: str):
-        quant_config_mapping_transformers, quant_config_mapping_diffusers = self._get_quant_config_list()
-
-        available_backends_transformers = (
-            list(quant_config_mapping_transformers.keys()) if quant_config_mapping_transformers else None
-        )
-        available_backends_diffusers = list(quant_config_mapping_diffusers.keys())
-
-        if (
-            available_backends_transformers and quant_backend not in available_backends_transformers
-        ) or quant_backend not in quant_config_mapping_diffusers:
-            error_message = f"Provided quant_backend={quant_backend} was not found."
-            if available_backends_transformers:
-                error_message += f"\nAvailable ones (transformers): {available_backends_transformers}."
-            error_message += f"\nAvailable ones (diffusers): {available_backends_diffusers}."
-            raise ValueError(error_message)
-
-    def _resolve_quant_config(self, is_diffusers: bool = True, module_name: str = None):
-        quant_config_mapping_transformers, quant_config_mapping_diffusers = self._get_quant_config_list()
-
-        quant_mapping = self.quant_mapping
-        components_to_quantize = self.components_to_quantize
-
-        # Granular case
-        if self.is_granular and module_name in quant_mapping:
-            logger.debug(f"Initializing quantization config class for {module_name}.")
-            config = quant_mapping[module_name]
-            return config
-
-        # Global config case
-        else:
-            should_quantize = False
-            # Only quantize the modules requested for.
-            if components_to_quantize and module_name in components_to_quantize:
-                should_quantize = True
-            # No specification for `components_to_quantize` means all modules should be quantized.
-            elif not self.is_granular and not components_to_quantize:
-                should_quantize = True
-
-            if should_quantize:
-                logger.debug(f"Initializing quantization config class for {module_name}.")
-                mapping_to_use = quant_config_mapping_diffusers if is_diffusers else quant_config_mapping_transformers
-                quant_config_cls = mapping_to_use[self.quant_backend]
-                quant_kwargs = self.quant_kwargs
-                return quant_config_cls(**quant_kwargs)
-
-        # Fallback: no applicable configuration found.
-        return None
-
-    def _get_quant_config_list(self):
-        if is_transformers_available():
-            from transformers.quantizers.auto import (
-                AUTO_QUANTIZATION_CONFIG_MAPPING as quant_config_mapping_transformers,
-            )
-        else:
-            quant_config_mapping_transformers = None
-
-        from ..quantizers.auto import AUTO_QUANTIZATION_CONFIG_MAPPING as quant_config_mapping_diffusers
-
-        return quant_config_mapping_transformers, quant_config_mapping_diffusers
+from .pipe_quant_config import PipelineQuantizationConfig
@@ -0,0 +1,202 @@
+# Copyright 2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+from typing import Dict, List, Optional, Union
+
+from ..utils import is_transformers_available, logging
+from .quantization_config import QuantizationConfigMixin as DiffQuantConfigMixin
+
+
+try:
+    from transformers.utils.quantization_config import QuantizationConfigMixin as TransformersQuantConfigMixin
+except ImportError:
+
+    class TransformersQuantConfigMixin:
+        pass
+
+
+logger = logging.get_logger(__name__)
+
+
+class PipelineQuantizationConfig:
+    """
+    Configuration class to be used when applying quantization on-the-fly to [`~DiffusionPipeline.from_pretrained`].
+
+    Args:
+        quant_backend (`str`): Quantization backend to be used. When using this option, we assume that the backend
+            is available to both `diffusers` and `transformers`.
+        quant_kwargs (`dict`): Params to initialize the quantization backend class.
+        components_to_quantize (`list`): Components of a pipeline to be quantized.
+        quant_mapping (`dict`): Mapping defining the quantization specs to be used for the pipeline
+            components. When using this argument, users are not expected to provide `quant_backend`, `quant_kawargs`,
+            and `components_to_quantize`.
+    """
+
+    def __init__(
+        self,
+        quant_backend: str = None,
+        quant_kwargs: Dict[str, Union[str, float, int, dict]] = None,
+        components_to_quantize: Optional[List[str]] = None,
+        quant_mapping: Dict[str, Union[DiffQuantConfigMixin, "TransformersQuantConfigMixin"]] = None,
+    ):
+        self.quant_backend = quant_backend
+        # Initialize kwargs to be {} to set to the defaults.
+        self.quant_kwargs = quant_kwargs or {}
+        self.components_to_quantize = components_to_quantize
+        self.quant_mapping = quant_mapping
+        self.config_mapping = {}  # book-keeping Example: `{module_name: quant_config}`
+        self.post_init()
+
+    def post_init(self):
+        quant_mapping = self.quant_mapping
+        self.is_granular = True if quant_mapping is not None else False
+
+        self._validate_init_args()
+
+    def _validate_init_args(self):
+        if self.quant_backend and self.quant_mapping:
+            raise ValueError("Both `quant_backend` and `quant_mapping` cannot be specified at the same time.")
+
+        if not self.quant_mapping and not self.quant_backend:
+            raise ValueError("Must provide a `quant_backend` when not providing a `quant_mapping`.")
+
+        if not self.quant_kwargs and not self.quant_mapping:
+            raise ValueError("Both `quant_kwargs` and `quant_mapping` cannot be None.")
+
+        if self.quant_backend is not None:
+            self._validate_init_kwargs_in_backends()
+
+        if self.quant_mapping is not None:
+            self._validate_quant_mapping_args()
+
+    def _validate_init_kwargs_in_backends(self):
+        quant_backend = self.quant_backend
+
+        self._check_backend_availability(quant_backend)
+
+        quant_config_mapping_transformers, quant_config_mapping_diffusers = self._get_quant_config_list()
+
+        if quant_config_mapping_transformers is not None:
+            init_kwargs_transformers = inspect.signature(quant_config_mapping_transformers[quant_backend].__init__)
+            init_kwargs_transformers = {name for name in init_kwargs_transformers.parameters if name != "self"}
+        else:
+            init_kwargs_transformers = None
+
+        init_kwargs_diffusers = inspect.signature(quant_config_mapping_diffusers[quant_backend].__init__)
+        init_kwargs_diffusers = {name for name in init_kwargs_diffusers.parameters if name != "self"}
+
+        if init_kwargs_transformers != init_kwargs_diffusers:
+            raise ValueError(
+                "The signatures of the __init__ methods of the quantization config classes in `diffusers` and `transformers` don't match. "
+                f"Please provide a `quant_mapping` instead, in the {self.__class__.__name__} class. Refer to [the docs](https://huggingface.co/docs/diffusers/main/en/quantization/overview#pipeline-level-quantization) to learn more about how "
+                "this mapping would look like."
+            )
+
+    def _validate_quant_mapping_args(self):
+        quant_mapping = self.quant_mapping
+        transformers_map, diffusers_map = self._get_quant_config_list()
+
+        available_transformers = list(transformers_map.values()) if transformers_map else None
+        available_diffusers = list(diffusers_map.values())
+
+        for module_name, config in quant_mapping.items():
+            if any(isinstance(config, cfg) for cfg in available_diffusers):
+                continue
+
+            if available_transformers and any(isinstance(config, cfg) for cfg in available_transformers):
+                continue
+
+            if available_transformers:
+                raise ValueError(
+                    f"Provided config for module_name={module_name} could not be found. "
+                    f"Available diffusers configs: {available_diffusers}; "
+                    f"Available transformers configs: {available_transformers}."
+                )
+            else:
+                raise ValueError(
+                    f"Provided config for module_name={module_name} could not be found. "
+                    f"Available diffusers configs: {available_diffusers}."
+                )
+
+    def _check_backend_availability(self, quant_backend: str):
+        quant_config_mapping_transformers, quant_config_mapping_diffusers = self._get_quant_config_list()
+
+        available_backends_transformers = (
+            list(quant_config_mapping_transformers.keys()) if quant_config_mapping_transformers else None
+        )
+        available_backends_diffusers = list(quant_config_mapping_diffusers.keys())
+
+        if (
+            available_backends_transformers and quant_backend not in available_backends_transformers
+        ) or quant_backend not in quant_config_mapping_diffusers:
+            error_message = f"Provided quant_backend={quant_backend} was not found."
+            if available_backends_transformers:
+                error_message += f"\nAvailable ones (transformers): {available_backends_transformers}."
+            error_message += f"\nAvailable ones (diffusers): {available_backends_diffusers}."
+            raise ValueError(error_message)
+
+    def _resolve_quant_config(self, is_diffusers: bool = True, module_name: str = None):
+        quant_config_mapping_transformers, quant_config_mapping_diffusers = self._get_quant_config_list()
+
+        quant_mapping = self.quant_mapping
+        components_to_quantize = self.components_to_quantize
+
+        # Granular case
+        if self.is_granular and module_name in quant_mapping:
+            logger.debug(f"Initializing quantization config class for {module_name}.")
+            config = quant_mapping[module_name]
+            self.config_mapping.update({module_name: config})
+            return config
+
+        # Global config case
+        else:
+            should_quantize = False
+            # Only quantize the modules requested for.
+            if components_to_quantize and module_name in components_to_quantize:
+                should_quantize = True
+            # No specification for `components_to_quantize` means all modules should be quantized.
+            elif not self.is_granular and not components_to_quantize:
+                should_quantize = True
+
+            if should_quantize:
+                logger.debug(f"Initializing quantization config class for {module_name}.")
+                mapping_to_use = quant_config_mapping_diffusers if is_diffusers else quant_config_mapping_transformers
+                quant_config_cls = mapping_to_use[self.quant_backend]
+                quant_kwargs = self.quant_kwargs
+                quant_obj = quant_config_cls(**quant_kwargs)
+                self.config_mapping.update({module_name: quant_obj})
+                return quant_obj
+
+        # Fallback: no applicable configuration found.
+        return None
+
+    def _get_quant_config_list(self):
+        if is_transformers_available():
+            from transformers.quantizers.auto import (
+                AUTO_QUANTIZATION_CONFIG_MAPPING as quant_config_mapping_transformers,
+            )
+        else:
+            quant_config_mapping_transformers = None
+
+        from ..quantizers.auto import AUTO_QUANTIZATION_CONFIG_MAPPING as quant_config_mapping_diffusers
+
+        return quant_config_mapping_transformers, quant_config_mapping_diffusers
+
+    def __repr__(self):
+        out = ""
+        config_mapping = dict(sorted(self.config_mapping.copy().items()))
+        for module_name, config in config_mapping.items():
+            out += f"{module_name} {config}"
+        return out
@@ -1,4 +1,5 @@
 import functools
+import glob
 import importlib
 import importlib.metadata
 import inspect
@@ -18,7 +19,7 @@ from collections import UserDict
 from contextlib import contextmanager
 from io import BytesIO, StringIO
 from pathlib import Path
-from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Set, Tuple, Union

 import numpy as np
 import PIL.Image
@@ -994,10 +995,10 @@ def pytest_terminal_summary_main(tr, id):
    config.option.tbstyle = orig_tbstyle


-# Copied from https://github.com/huggingface/transformers/blob/000e52aec8850d3fe2f360adc6fd256e5b47fe4c/src/transformers/testing_utils.py#L1905
+# Adapted from https://github.com/huggingface/transformers/blob/000e52aec8850d3fe2f360adc6fd256e5b47fe4c/src/transformers/testing_utils.py#L1905
 def is_flaky(max_attempts: int = 5, wait_before_retry: Optional[float] = None, description: Optional[str] = None):
    """
-    To decorate flaky tests. They will be retried on failures.
+    To decorate flaky tests (methods or entire classes). They will be retried on failures.

    Args:
        max_attempts (`int`, *optional*, defaults to 5):
@@ -1009,22 +1010,33 @@ def is_flaky(max_attempts: int = 5, wait_before_retry: Optional[float] = None, d
            etc.)
    """

-    def decorator(test_func_ref):
-        @functools.wraps(test_func_ref)
+    def decorator(obj):
+        # If decorating a class, wrap each test method on it
+        if inspect.isclass(obj):
+            for attr_name, attr_value in list(obj.__dict__.items()):
+                if callable(attr_value) and attr_name.startswith("test"):
+                    # recursively decorate the method
+                    setattr(obj, attr_name, decorator(attr_value))
+            return obj
+
+        # Otherwise we're decorating a single test function / method
+        @functools.wraps(obj)
        def wrapper(*args, **kwargs):
            retry_count = 1
-
            while retry_count < max_attempts:
                try:
-                    return test_func_ref(*args, **kwargs)
-
+                    return obj(*args, **kwargs)
                except Exception as err:
-                    print(f"Test failed with {err} at try {retry_count}/{max_attempts}.", file=sys.stderr)
+                    msg = (
+                        f"[FLAKY] {description or obj.__name__!r} "
+                        f"failed on attempt {retry_count}/{max_attempts}: {err}"
+                    )
+                    print(msg, file=sys.stderr)
                    if wait_before_retry is not None:
                        time.sleep(wait_before_retry)
                    retry_count += 1

-            return test_func_ref(*args, **kwargs)
+            return obj(*args, **kwargs)

        return wrapper

@@ -1381,6 +1393,103 @@ if TYPE_CHECKING:
 else:
    DevicePropertiesUserDict = UserDict

+if is_torch_available():
+    from diffusers.hooks.group_offloading import (
+        _GROUP_ID_LAZY_LEAF,
+        _SUPPORTED_PYTORCH_LAYERS,
+        _compute_group_hash,
+        _find_parent_module_in_module_dict,
+        _gather_buffers_with_no_group_offloading_parent,
+        _gather_parameters_with_no_group_offloading_parent,
+    )
+
+    def _get_expected_safetensors_files(
+        module: torch.nn.Module,
+        offload_to_disk_path: str,
+        offload_type: str,
+        num_blocks_per_group: Optional[int] = None,
+    ) -> Set[str]:
+        expected_files = set()
+
+        def get_hashed_filename(group_id: str) -> str:
+            short_hash = _compute_group_hash(group_id)
+            return os.path.join(offload_to_disk_path, f"group_{short_hash}.safetensors")
+
+        if offload_type == "block_level":
+            if num_blocks_per_group is None:
+                raise ValueError("num_blocks_per_group must be provided for 'block_level' offloading.")
+
+            # Handle groups of ModuleList and Sequential blocks
+            unmatched_modules = []
+            for name, submodule in module.named_children():
+                if not isinstance(submodule, (torch.nn.ModuleList, torch.nn.Sequential)):
+                    unmatched_modules.append(module)
+                    continue
+
+                for i in range(0, len(submodule), num_blocks_per_group):
+                    current_modules = submodule[i : i + num_blocks_per_group]
+                    if not current_modules:
+                        continue
+                    group_id = f"{name}_{i}_{i + len(current_modules) - 1}"
+                    expected_files.add(get_hashed_filename(group_id))
+
+            # Handle the group for unmatched top-level modules and parameters
+            for module in unmatched_modules:
+                expected_files.add(get_hashed_filename(f"{module.__class__.__name__}_unmatched_group"))
+
+        elif offload_type == "leaf_level":
+            # Handle leaf-level module groups
+            for name, submodule in module.named_modules():
+                if isinstance(submodule, _SUPPORTED_PYTORCH_LAYERS):
+                    # These groups will always have parameters, so a file is expected
+                    expected_files.add(get_hashed_filename(name))
+
+            # Handle groups for non-leaf parameters/buffers
+            modules_with_group_offloading = {
+                name for name, sm in module.named_modules() if isinstance(sm, _SUPPORTED_PYTORCH_LAYERS)
+            }
+            parameters = _gather_parameters_with_no_group_offloading_parent(module, modules_with_group_offloading)
+            buffers = _gather_buffers_with_no_group_offloading_parent(module, modules_with_group_offloading)
+
+            all_orphans = parameters + buffers
+            if all_orphans:
+                parent_to_tensors = {}
+                module_dict = dict(module.named_modules())
+                for tensor_name, _ in all_orphans:
+                    parent_name = _find_parent_module_in_module_dict(tensor_name, module_dict)
+                    if parent_name not in parent_to_tensors:
+                        parent_to_tensors[parent_name] = []
+                    parent_to_tensors[parent_name].append(tensor_name)
+
+                for parent_name in parent_to_tensors:
+                    # A file is expected for each parent that gathers orphaned tensors
+                    expected_files.add(get_hashed_filename(parent_name))
+            expected_files.add(get_hashed_filename(_GROUP_ID_LAZY_LEAF))
+
+        else:
+            raise ValueError(f"Unsupported offload_type: {offload_type}")
+
+        return expected_files
+
+    def _check_safetensors_serialization(
+        module: torch.nn.Module,
+        offload_to_disk_path: str,
+        offload_type: str,
+        num_blocks_per_group: Optional[int] = None,
+    ) -> bool:
+        if not os.path.isdir(offload_to_disk_path):
+            return False, None, None
+
+        expected_files = _get_expected_safetensors_files(
+            module, offload_to_disk_path, offload_type, num_blocks_per_group
+        )
+        actual_files = set(glob.glob(os.path.join(offload_to_disk_path, "*.safetensors")))
+        missing_files = expected_files - actual_files
+        extra_files = actual_files - expected_files
+
+        is_correct = not missing_files and not extra_files
+        return is_correct, extra_files, missing_files
+

 class Expectations(DevicePropertiesUserDict):
    def get_expectation(self) -> Any:
@@ -175,6 +175,8 @@ def get_device():
        return "npu"
    elif hasattr(torch, "xpu") and torch.xpu.is_available():
        return "xpu"
+    elif torch.backends.mps.is_available():
+        return "mps"
    else:
        return "cpu"

@@ -182,5 +184,14 @@ def get_device():
 def empty_device_cache(device_type: Optional[str] = None):
    if device_type is None:
        device_type = get_device()
+    if device_type in ["cpu"]:
+        return
    device_mod = getattr(torch, device_type, torch.cuda)
    device_mod.empty_cache()
+
+
+def device_synchronize(device_type: Optional[str] = None):
+    if device_type is None:
+        device_type = get_device()
+    device_mod = getattr(torch, device_type, torch.cuda)
+    device_mod.synchronize()
@@ -46,6 +46,7 @@ from utils import PeftLoraLoaderMixinTests  # noqa: E402

@require_peft_backend
@skip_mps
+@is_flaky(max_attempts=10, description="very flaky class")
 class WanVACELoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
    pipeline_class = WanVACEPipeline
    scheduler_cls = FlowMatchEulerDiscreteScheduler
@@ -217,6 +218,5 @@ class WanVACELoRATests(unittest.TestCase, PeftLoraLoaderMixinTests):
                "Lora outputs should match.",
            )

-    @is_flaky
    def test_simple_inference_with_text_denoiser_lora_and_scale(self):
        super().test_simple_inference_with_text_denoiser_lora_and_scale()
@@ -2510,3 +2510,34 @@ class PeftLoraLoaderMixinTests:
                # materializes the test methods on invocation which cannot be overridden.
                return
        self._test_group_offloading_inference_denoiser(offload_type, use_stream)
+
+    @require_torch_accelerator
+    def test_lora_loading_model_cpu_offload(self):
+        components, _, denoiser_lora_config = self.get_dummy_components(self.scheduler_classes[0])
+        _, _, inputs = self.get_dummy_inputs(with_generator=False)
+        pipe = self.pipeline_class(**components)
+        pipe = pipe.to(torch_device)
+        pipe.set_progress_bar_config(disable=None)
+
+        denoiser = pipe.transformer if self.unet_kwargs is None else pipe.unet
+        denoiser.add_adapter(denoiser_lora_config)
+        self.assertTrue(check_if_lora_correctly_set(denoiser), "Lora not correctly set in denoiser.")
+
+        output_lora = pipe(**inputs, generator=torch.manual_seed(0))[0]
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            modules_to_save = self._get_modules_to_save(pipe, has_denoiser=True)
+            lora_state_dicts = self._get_lora_state_dicts(modules_to_save)
+            self.pipeline_class.save_lora_weights(
+                save_directory=tmpdirname, safe_serialization=True, **lora_state_dicts
+            )
+            # reinitialize the pipeline to mimic the inference workflow.
+            components, _, denoiser_lora_config = self.get_dummy_components(self.scheduler_classes[0])
+            pipe = self.pipeline_class(**components)
+            pipe.enable_model_cpu_offload(device=torch_device)
+            pipe.load_lora_weights(tmpdirname)
+            denoiser = pipe.transformer if self.unet_kwargs is None else pipe.unet
+            self.assertTrue(check_if_lora_correctly_set(denoiser), "Lora not correctly set in denoiser.")
+
+        output_lora_loaded = pipe(**inputs, generator=torch.manual_seed(0))[0]
+        self.assertTrue(np.allclose(output_lora, output_lora_loaded, atol=1e-3, rtol=1e-3))
@@ -61,6 +61,7 @@ from diffusers.utils import (
 from diffusers.utils.hub_utils import _add_variant
 from diffusers.utils.testing_utils import (
    CaptureLogger,
+    _check_safetensors_serialization,
    backend_empty_cache,
    backend_max_memory_allocated,
    backend_reset_peak_memory_stats,
@@ -1702,18 +1703,43 @@ class ModelTesterMixin:
        model.enable_layerwise_casting(storage_dtype=storage_dtype, compute_dtype=compute_dtype)
        _ = model(**inputs_dict)[0]

-    @parameterized.expand([(False, "block_level"), (True, "leaf_level")])
+    @parameterized.expand([("block_level", False), ("leaf_level", True)])
    @require_torch_accelerator
    @torch.no_grad()
-    def test_group_offloading_with_disk(self, record_stream, offload_type):
+    @torch.inference_mode()
+    def test_group_offloading_with_disk(self, offload_type, record_stream, atol=1e-5):
        if not self.model_class._supports_group_offloading:
            pytest.skip("Model does not support group offloading.")

-        torch.manual_seed(0)
+        def _has_generator_arg(model):
+            sig = inspect.signature(model.forward)
+            params = sig.parameters
+            return "generator" in params
+
+        def _run_forward(model, inputs_dict):
+            accepts_generator = _has_generator_arg(model)
+            if accepts_generator:
+                inputs_dict["generator"] = torch.manual_seed(0)
+            torch.manual_seed(0)
+            return model(**inputs_dict)[0]
+
+        if self.__class__.__name__ == "AutoencoderKLCosmosTests" and offload_type == "leaf_level":
+            pytest.skip("With `leaf_type` as the offloading type, it fails. Needs investigation.")
+
        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()
+        torch.manual_seed(0)
+        model = self.model_class(**init_dict)
+
+        model.eval()
+        model.to(torch_device)
+        output_without_group_offloading = _run_forward(model, inputs_dict)
+
+        torch.manual_seed(0)
        model = self.model_class(**init_dict)
        model.eval()
-        additional_kwargs = {} if offload_type == "leaf_level" else {"num_blocks_per_group": 1}
+
+        num_blocks_per_group = None if offload_type == "leaf_level" else 1
+        additional_kwargs = {} if offload_type == "leaf_level" else {"num_blocks_per_group": num_blocks_per_group}
        with tempfile.TemporaryDirectory() as tmpdir:
            model.enable_group_offload(
                torch_device,
@@ -1724,8 +1750,25 @@ class ModelTesterMixin:
                **additional_kwargs,
            )
            has_safetensors = glob.glob(f"{tmpdir}/*.safetensors")
-            self.assertTrue(len(has_safetensors) > 0, "No safetensors found in the offload directory.")
-            _ = model(**inputs_dict)[0]
+            self.assertTrue(has_safetensors, "No safetensors found in the directory.")
+
+            # For "leaf-level", there is a prefetching hook which makes this check a bit non-deterministic
+            # in nature. So, skip it.
+            if offload_type != "leaf_level":
+                is_correct, extra_files, missing_files = _check_safetensors_serialization(
+                    module=model,
+                    offload_to_disk_path=tmpdir,
+                    offload_type=offload_type,
+                    num_blocks_per_group=num_blocks_per_group,
+                )
+                if not is_correct:
+                    if extra_files:
+                        raise ValueError(f"Found extra files: {', '.join(extra_files)}")
+                    elif missing_files:
+                        raise ValueError(f"Following files are missing: {', '.join(missing_files)}")
+
+            output_with_group_offloading = _run_forward(model, inputs_dict)
+            self.assertTrue(torch.allclose(output_without_group_offloading, output_with_group_offloading, atol=atol))

    def test_auto_model(self, expected_max_diff=5e-5):
        if self.forward_requires_fresh_args:
@@ -155,7 +155,7 @@ class FluxPipelineFastTests(

        # Outputs should be different here
        # For some reasons, they don't show large differences
-        assert max_diff > 1e-6
+        self.assertGreater(max_diff, 1e-6, "Outputs should be different for different prompts.")

    def test_fused_qkv_projections(self):
        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
@@ -187,14 +187,17 @@ class FluxPipelineFastTests(
        image = pipe(**inputs).images
        image_slice_disabled = image[0, -3:, -3:, -1]

-        assert np.allclose(original_image_slice, image_slice_fused, atol=1e-3, rtol=1e-3), (
-            "Fusion of QKV projections shouldn't affect the outputs."
+        self.assertTrue(
+            np.allclose(original_image_slice, image_slice_fused, atol=1e-3, rtol=1e-3),
+            ("Fusion of QKV projections shouldn't affect the outputs."),
        )
-        assert np.allclose(image_slice_fused, image_slice_disabled, atol=1e-3, rtol=1e-3), (
-            "Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."
+        self.assertTrue(
+            np.allclose(image_slice_fused, image_slice_disabled, atol=1e-3, rtol=1e-3),
+            ("Outputs, with QKV projection fusion enabled, shouldn't change when fused QKV projections are disabled."),
        )
-        assert np.allclose(original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2), (
-            "Original outputs should match when fused QKV projections are disabled."
+        self.assertTrue(
+            np.allclose(original_image_slice, image_slice_disabled, atol=1e-2, rtol=1e-2),
+            ("Original outputs should match when fused QKV projections are disabled."),
        )

    def test_flux_image_output_shape(self):
@@ -209,7 +212,11 @@ class FluxPipelineFastTests(
            inputs.update({"height": height, "width": width})
            image = pipe(**inputs).images[0]
            output_height, output_width, _ = image.shape
-            assert (output_height, output_width) == (expected_height, expected_width)
+            self.assertEqual(
+                (output_height, output_width),
+                (expected_height, expected_width),
+                f"Output shape {image.shape} does not match expected shape {(expected_height, expected_width)}",
+            )

    def test_flux_true_cfg(self):
        pipe = self.pipeline_class(**self.get_dummy_components()).to(torch_device)
@@ -220,7 +227,9 @@ class FluxPipelineFastTests(
        inputs["negative_prompt"] = "bad quality"
        inputs["true_cfg_scale"] = 2.0
        true_cfg_out = pipe(**inputs, generator=torch.manual_seed(0)).images[0]
-        assert not np.allclose(no_true_cfg_out, true_cfg_out)
+        self.assertFalse(
+            np.allclose(no_true_cfg_out, true_cfg_out), "Outputs should be different when true_cfg_scale is set."
+        )


@nightly
@@ -269,45 +278,17 @@ class FluxPipelineSlowTests(unittest.TestCase):

        image = pipe(**inputs).images[0]
        image_slice = image[0, :10, :10]
+        # fmt: off
        expected_slice = np.array(
-            [
-                0.3242,
-                0.3203,
-                0.3164,
-                0.3164,
-                0.3125,
-                0.3125,
-                0.3281,
-                0.3242,
-                0.3203,
-                0.3301,
-                0.3262,
-                0.3242,
-                0.3281,
-                0.3242,
-                0.3203,
-                0.3262,
-                0.3262,
-                0.3164,
-                0.3262,
-                0.3281,
-                0.3184,
-                0.3281,
-                0.3281,
-                0.3203,
-                0.3281,
-                0.3281,
-                0.3164,
-                0.3320,
-                0.3320,
-                0.3203,
-            ],
+            [0.3242, 0.3203, 0.3164, 0.3164, 0.3125, 0.3125, 0.3281, 0.3242, 0.3203, 0.3301, 0.3262, 0.3242, 0.3281, 0.3242, 0.3203, 0.3262, 0.3262, 0.3164, 0.3262, 0.3281, 0.3184, 0.3281, 0.3281, 0.3203, 0.3281, 0.3281, 0.3164, 0.3320, 0.3320, 0.3203],
            dtype=np.float32,
        )
+        # fmt: on

        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), image_slice.flatten())
-
-        assert max_diff < 1e-4
+        self.assertLess(
+            max_diff, 1e-4, f"Image slice is different from expected slice: {image_slice} != {expected_slice}"
+        )


@slow
@@ -377,42 +358,14 @@ class FluxIPAdapterPipelineSlowTests(unittest.TestCase):
        image = pipe(**inputs).images[0]
        image_slice = image[0, :10, :10]

+        # fmt: off
        expected_slice = np.array(
-            [
-                0.1855,
-                0.1680,
-                0.1406,
-                0.1953,
-                0.1699,
-                0.1465,
-                0.2012,
-                0.1738,
-                0.1484,
-                0.2051,
-                0.1797,
-                0.1523,
-                0.2012,
-                0.1719,
-                0.1445,
-                0.2070,
-                0.1777,
-                0.1465,
-                0.2090,
-                0.1836,
-                0.1484,
-                0.2129,
-                0.1875,
-                0.1523,
-                0.2090,
-                0.1816,
-                0.1484,
-                0.2110,
-                0.1836,
-                0.1543,
-            ],
+            [0.1855, 0.1680, 0.1406, 0.1953, 0.1699, 0.1465, 0.2012, 0.1738, 0.1484, 0.2051, 0.1797, 0.1523, 0.2012, 0.1719, 0.1445, 0.2070, 0.1777, 0.1465, 0.2090, 0.1836, 0.1484, 0.2129, 0.1875, 0.1523, 0.2090, 0.1816, 0.1484, 0.2110, 0.1836, 0.1543],
            dtype=np.float32,
        )
+        # fmt: on

        max_diff = numpy_cosine_similarity_distance(expected_slice.flatten(), image_slice.flatten())
-
-        assert max_diff < 1e-4, f"{image_slice} != {expected_slice}"
+        self.assertLess(
+            max_diff, 1e-4, f"Image slice is different from expected slice: {image_slice} != {expected_slice}"
+        )
@@ -1,147 +0,0 @@
-# coding=utf-8
-# Copyright 2025 HuggingFace Inc.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import gc
-import unittest
-
-import numpy as np
-import torch
-
-from diffusers import StableDiffusionKDiffusionPipeline
-from diffusers.utils.testing_utils import (
-    backend_empty_cache,
-    enable_full_determinism,
-    nightly,
-    require_torch_accelerator,
-    torch_device,
-)
-
-
-enable_full_determinism()
-
-
-@nightly
-@require_torch_accelerator
-class StableDiffusionPipelineIntegrationTests(unittest.TestCase):
-    def setUp(self):
-        # clean up the VRAM before each test
-        super().setUp()
-        gc.collect()
-        backend_empty_cache(torch_device)
-
-    def tearDown(self):
-        # clean up the VRAM after each test
-        super().tearDown()
-        gc.collect()
-        backend_empty_cache(torch_device)
-
-    def test_stable_diffusion_1(self):
-        sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_euler")
-
-        prompt = "A painting of a squirrel eating a burger"
-        generator = torch.manual_seed(0)
-        output = sd_pipe([prompt], generator=generator, guidance_scale=9.0, num_inference_steps=20, output_type="np")
-
-        image = output.images
-
-        image_slice = image[0, -3:, -3:, -1]
-
-        assert image.shape == (1, 512, 512, 3)
-        expected_slice = np.array([0.0447, 0.0492, 0.0468, 0.0408, 0.0383, 0.0408, 0.0354, 0.0380, 0.0339])
-
-        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
-
-    def test_stable_diffusion_2(self):
-        sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_euler")
-
-        prompt = "A painting of a squirrel eating a burger"
-        generator = torch.manual_seed(0)
-        output = sd_pipe([prompt], generator=generator, guidance_scale=9.0, num_inference_steps=20, output_type="np")
-
-        image = output.images
-
-        image_slice = image[0, -3:, -3:, -1]
-
-        assert image.shape == (1, 512, 512, 3)
-        expected_slice = np.array([0.1237, 0.1320, 0.1438, 0.1359, 0.1390, 0.1132, 0.1277, 0.1175, 0.1112])
-
-        assert np.abs(image_slice.flatten() - expected_slice).max() < 5e-1
-
-    def test_stable_diffusion_karras_sigmas(self):
-        sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_dpmpp_2m")
-
-        prompt = "A painting of a squirrel eating a burger"
-        generator = torch.manual_seed(0)
-        output = sd_pipe(
-            [prompt],
-            generator=generator,
-            guidance_scale=7.5,
-            num_inference_steps=15,
-            output_type="np",
-            use_karras_sigmas=True,
-        )
-
-        image = output.images
-
-        image_slice = image[0, -3:, -3:, -1]
-
-        assert image.shape == (1, 512, 512, 3)
-        expected_slice = np.array(
-            [0.11381689, 0.12112921, 0.1389457, 0.12549606, 0.1244964, 0.10831517, 0.11562866, 0.10867816, 0.10499048]
-        )
-
-        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
-
-    def test_stable_diffusion_noise_sampler_seed(self):
-        sd_pipe = StableDiffusionKDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_dpmpp_sde")
-
-        prompt = "A painting of a squirrel eating a burger"
-        seed = 0
-        images1 = sd_pipe(
-            [prompt],
-            generator=torch.manual_seed(seed),
-            noise_sampler_seed=seed,
-            guidance_scale=9.0,
-            num_inference_steps=20,
-            output_type="np",
-        ).images
-        images2 = sd_pipe(
-            [prompt],
-            generator=torch.manual_seed(seed),
-            noise_sampler_seed=seed,
-            guidance_scale=9.0,
-            num_inference_steps=20,
-            output_type="np",
-        ).images
-
-        assert images1.shape == (1, 512, 512, 3)
-        assert images2.shape == (1, 512, 512, 3)
-        assert np.abs(images1.flatten() - images2.flatten()).max() < 1e-2
@@ -1,178 +0,0 @@
-# coding=utf-8
-# Copyright 2025 HuggingFace Inc.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import gc
-import unittest
-
-import numpy as np
-import torch
-
-from diffusers import StableDiffusionXLKDiffusionPipeline
-from diffusers.utils.testing_utils import (
-    Expectations,
-    backend_empty_cache,
-    enable_full_determinism,
-    require_torch_accelerator,
-    slow,
-    torch_device,
-)
-
-
-enable_full_determinism()
-
-
-@slow
-@require_torch_accelerator
-class StableDiffusionXLKPipelineIntegrationTests(unittest.TestCase):
-    dtype = torch.float16
-
-    def setUp(self):
-        # clean up the VRAM before each test
-        super().setUp()
-        gc.collect()
-        backend_empty_cache(torch_device)
-
-    def tearDown(self):
-        # clean up the VRAM after each test
-        super().tearDown()
-        gc.collect()
-        backend_empty_cache(torch_device)
-
-    def test_stable_diffusion_xl(self):
-        sd_pipe = StableDiffusionXLKDiffusionPipeline.from_pretrained(
-            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=self.dtype
-        )
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_euler")
-
-        prompt = "A painting of a squirrel eating a burger"
-        generator = torch.manual_seed(0)
-        output = sd_pipe(
-            [prompt],
-            generator=generator,
-            guidance_scale=9.0,
-            num_inference_steps=2,
-            height=512,
-            width=512,
-            output_type="np",
-        )
-
-        image = output.images
-
-        image_slice = image[0, -3:, -3:, -1]
-
-        assert image.shape == (1, 512, 512, 3)
-        expected_slice = np.array([0.5420, 0.5038, 0.2439, 0.5371, 0.4660, 0.1906, 0.5221, 0.4290, 0.2566])
-
-        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
-
-    def test_stable_diffusion_karras_sigmas(self):
-        sd_pipe = StableDiffusionXLKDiffusionPipeline.from_pretrained(
-            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=self.dtype
-        )
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_dpmpp_2m")
-
-        prompt = "A painting of a squirrel eating a burger"
-        generator = torch.manual_seed(0)
-        output = sd_pipe(
-            [prompt],
-            generator=generator,
-            guidance_scale=7.5,
-            num_inference_steps=2,
-            output_type="np",
-            use_karras_sigmas=True,
-            height=512,
-            width=512,
-        )
-
-        image = output.images
-
-        image_slice = image[0, -3:, -3:, -1]
-
-        assert image.shape == (1, 512, 512, 3)
-        expected_slices = Expectations(
-            {
-                ("xpu", 3): np.array(
-                    [
-                        0.6128,
-                        0.6108,
-                        0.6109,
-                        0.5997,
-                        0.5988,
-                        0.5948,
-                        0.5903,
-                        0.597,
-                        0.5973,
-                    ]
-                ),
-                ("cuda", 7): np.array(
-                    [
-                        0.6418,
-                        0.6424,
-                        0.6462,
-                        0.6271,
-                        0.6314,
-                        0.6295,
-                        0.6249,
-                        0.6339,
-                        0.6335,
-                    ]
-                ),
-            }
-        )
-
-        expected_slice = expected_slices.get_expectation()
-
-        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
-
-    def test_stable_diffusion_noise_sampler_seed(self):
-        sd_pipe = StableDiffusionXLKDiffusionPipeline.from_pretrained(
-            "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=self.dtype
-        )
-        sd_pipe = sd_pipe.to(torch_device)
-        sd_pipe.set_progress_bar_config(disable=None)
-
-        sd_pipe.set_scheduler("sample_dpmpp_sde")
-
-        prompt = "A painting of a squirrel eating a burger"
-        seed = 0
-        images1 = sd_pipe(
-            [prompt],
-            generator=torch.manual_seed(seed),
-            noise_sampler_seed=seed,
-            guidance_scale=9.0,
-            num_inference_steps=2,
-            output_type="np",
-            height=512,
-            width=512,
-        ).images
-        images2 = sd_pipe(
-            [prompt],
-            generator=torch.manual_seed(seed),
-            noise_sampler_seed=seed,
-            guidance_scale=9.0,
-            num_inference_steps=2,
-            output_type="np",
-            height=512,
-            width=512,
-        ).images
-        assert images1.shape == (1, 512, 512, 3)
-        assert images2.shape == (1, 512, 512, 3)
-        assert np.abs(images1.flatten() - images2.flatten()).max() < 1e-2
@@ -872,11 +872,12 @@ class ExtendedSerializationTest(BaseBnb4BitSerializationTests):


@require_torch_version_greater("2.7.1")
-class Bnb4BitCompileTests(QuantCompileTests):
+@require_bitsandbytes_version_greater("0.45.5")
+class Bnb4BitCompileTests(QuantCompileTests, unittest.TestCase):
    @property
    def quantization_config(self):
        return PipelineQuantizationConfig(
-            quant_backend="bitsandbytes_8bit",
+            quant_backend="bitsandbytes_4bit",
            quant_kwargs={
                "load_in_4bit": True,
                "bnb_4bit_quant_type": "nf4",
@@ -887,12 +888,7 @@ class Bnb4BitCompileTests(QuantCompileTests):

    def test_torch_compile(self):
        torch._dynamo.config.capture_dynamic_output_shape_ops = True
-        super()._test_torch_compile(quantization_config=self.quantization_config)
-
-    def test_torch_compile_with_cpu_offload(self):
-        super()._test_torch_compile_with_cpu_offload(quantization_config=self.quantization_config)
+        super().test_torch_compile()

    def test_torch_compile_with_group_offload_leaf(self):
-        super()._test_torch_compile_with_group_offload_leaf(
-            quantization_config=self.quantization_config, use_stream=True
-        )
+        super()._test_torch_compile_with_group_offload_leaf(use_stream=True)
@@ -837,7 +837,8 @@ class BaseBnb8bitSerializationTests(Base8bitTests):


@require_torch_version_greater_equal("2.6.0")
-class Bnb8BitCompileTests(QuantCompileTests):
+@require_bitsandbytes_version_greater("0.45.5")
+class Bnb8BitCompileTests(QuantCompileTests, unittest.TestCase):
    @property
    def quantization_config(self):
        return PipelineQuantizationConfig(
@@ -848,15 +849,11 @@ class Bnb8BitCompileTests(QuantCompileTests):

    def test_torch_compile(self):
        torch._dynamo.config.capture_dynamic_output_shape_ops = True
-        super()._test_torch_compile(quantization_config=self.quantization_config, torch_dtype=torch.float16)
+        super()._test_torch_compile(torch_dtype=torch.float16)

    def test_torch_compile_with_cpu_offload(self):
-        super()._test_torch_compile_with_cpu_offload(
-            quantization_config=self.quantization_config, torch_dtype=torch.float16
-        )
+        super()._test_torch_compile_with_cpu_offload(torch_dtype=torch.float16)

    @pytest.mark.xfail(reason="Test fails because of an offloading problem from Accelerate with confusion in hooks.")
    def test_torch_compile_with_group_offload_leaf(self):
-        super()._test_torch_compile_with_group_offload_leaf(
-            quantization_config=self.quantization_config, torch_dtype=torch.float16, use_stream=True
-        )
+        super()._test_torch_compile_with_group_offload_leaf(torch_dtype=torch.float16, use_stream=True)
@@ -8,6 +8,7 @@ import torch.nn as nn
 from diffusers import (
    AuraFlowPipeline,
    AuraFlowTransformer2DModel,
+    DiffusionPipeline,
    FluxControlPipeline,
    FluxPipeline,
    FluxTransformer2DModel,
@@ -32,9 +33,12 @@ from diffusers.utils.testing_utils import (
    require_big_accelerator,
    require_gguf_version_greater_or_equal,
    require_peft_backend,
+    require_torch_version_greater,
    torch_device,
 )

+from ..test_torch_compile_utils import QuantCompileTests
+

 if is_gguf_available():
    from diffusers.quantizers.gguf.utils import GGUFLinear, GGUFParameter
@@ -647,3 +651,22 @@ class WanVACEGGUFSingleFileTests(GGUFSingleFileTesterMixin, unittest.TestCase):
            ).to(torch_device, self.torch_dtype),
            "timestep": torch.tensor([1]).to(torch_device, self.torch_dtype),
        }
+
+
+@require_torch_version_greater("2.7.1")
+class GGUFCompileTests(QuantCompileTests, unittest.TestCase):
+    torch_dtype = torch.bfloat16
+    gguf_ckpt = "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
+
+    @property
+    def quantization_config(self):
+        return GGUFQuantizationConfig(compute_dtype=self.torch_dtype)
+
+    def _init_pipeline(self, *args, **kwargs):
+        transformer = FluxTransformer2DModel.from_single_file(
+            self.gguf_ckpt, quantization_config=self.quantization_config, torch_dtype=self.torch_dtype
+        )
+        pipe = DiffusionPipeline.from_pretrained(
+            "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=self.torch_dtype
+        )
+        return pipe
@@ -12,13 +12,14 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import json
 import tempfile
 import unittest

 import torch
 from parameterized import parameterized

-from diffusers import DiffusionPipeline, QuantoConfig
+from diffusers import BitsAndBytesConfig, DiffusionPipeline, QuantoConfig
 from diffusers.quantizers import PipelineQuantizationConfig
 from diffusers.utils import logging
 from diffusers.utils.testing_utils import (
@@ -243,3 +244,57 @@ class PipelineQuantizationTests(unittest.TestCase):
        for name, component in pipe.components.items():
            if isinstance(component, torch.nn.Module):
                self.assertTrue(not hasattr(component.config, "quantization_config"))
+
+    @parameterized.expand(["quant_kwargs", "quant_mapping"])
+    def test_quant_config_repr(self, method):
+        component_name = "transformer"
+        if method == "quant_kwargs":
+            components_to_quantize = [component_name]
+            quant_config = PipelineQuantizationConfig(
+                quant_backend="bitsandbytes_8bit",
+                quant_kwargs={"load_in_8bit": True},
+                components_to_quantize=components_to_quantize,
+            )
+        else:
+            quant_config = PipelineQuantizationConfig(
+                quant_mapping={component_name: BitsAndBytesConfig(load_in_8bit=True)}
+            )
+
+        pipe = DiffusionPipeline.from_pretrained(
+            self.model_name,
+            quantization_config=quant_config,
+            torch_dtype=torch.bfloat16,
+        )
+        self.assertTrue(getattr(pipe, "quantization_config", None) is not None)
+        retrieved_config = pipe.quantization_config
+        expected_config = """
+transformer BitsAndBytesConfig {
+  "_load_in_4bit": false,
+  "_load_in_8bit": true,
+  "bnb_4bit_compute_dtype": "float32",
+  "bnb_4bit_quant_storage": "uint8",
+  "bnb_4bit_quant_type": "fp4",
+  "bnb_4bit_use_double_quant": false,
+  "llm_int8_enable_fp32_cpu_offload": false,
+  "llm_int8_has_fp16_weight": false,
+  "llm_int8_skip_modules": null,
+  "llm_int8_threshold": 6.0,
+  "load_in_4bit": false,
+  "load_in_8bit": true,
+  "quant_method": "bitsandbytes"
+}
+
+"""
+        expected_data = self._parse_config_string(expected_config)
+        actual_data = self._parse_config_string(str(retrieved_config))
+        self.assertTrue(actual_data == expected_data)
+
+    def _parse_config_string(self, config_string: str) -> tuple[str, dict]:
+        first_brace = config_string.find("{")
+        if first_brace == -1:
+            raise ValueError("Could not find opening brace '{' in the string.")
+
+        json_part = config_string[first_brace:]
+        data = json.loads(json_part)
+
+        return data
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import gc
-import unittest
+import inspect

 import torch

@@ -23,7 +23,7 @@ from diffusers.utils.testing_utils import backend_empty_cache, require_torch_gpu

@require_torch_gpu
@slow
-class QuantCompileTests(unittest.TestCase):
+class QuantCompileTests:
    @property
    def quantization_config(self):
        raise NotImplementedError(
@@ -50,30 +50,26 @@ class QuantCompileTests(unittest.TestCase):
        )
        return pipe

-    def _test_torch_compile(self, quantization_config, torch_dtype=torch.bfloat16):
-        pipe = self._init_pipeline(quantization_config, torch_dtype).to("cuda")
-        # import to ensure fullgraph True
+    def _test_torch_compile(self, torch_dtype=torch.bfloat16):
+        pipe = self._init_pipeline(self.quantization_config, torch_dtype).to("cuda")
+        # `fullgraph=True` ensures no graph breaks
        pipe.transformer.compile(fullgraph=True)

-        for _ in range(2):
-            # small resolutions to ensure speedy execution.
-            pipe("a dog", num_inference_steps=3, max_sequence_length=16, height=256, width=256)
+        # small resolutions to ensure speedy execution.
+        pipe("a dog", num_inference_steps=2, max_sequence_length=16, height=256, width=256)

-    def _test_torch_compile_with_cpu_offload(self, quantization_config, torch_dtype=torch.bfloat16):
-        pipe = self._init_pipeline(quantization_config, torch_dtype)
+    def _test_torch_compile_with_cpu_offload(self, torch_dtype=torch.bfloat16):
+        pipe = self._init_pipeline(self.quantization_config, torch_dtype)
        pipe.enable_model_cpu_offload()
        pipe.transformer.compile()

-        for _ in range(2):
-            # small resolutions to ensure speedy execution.
-            pipe("a dog", num_inference_steps=3, max_sequence_length=16, height=256, width=256)
+        # small resolutions to ensure speedy execution.
+        pipe("a dog", num_inference_steps=2, max_sequence_length=16, height=256, width=256)

-    def _test_torch_compile_with_group_offload_leaf(
-        self, quantization_config, torch_dtype=torch.bfloat16, *, use_stream: bool = False
-    ):
-        torch._dynamo.config.cache_size_limit = 10000
+    def _test_torch_compile_with_group_offload_leaf(self, torch_dtype=torch.bfloat16, *, use_stream: bool = False):
+        torch._dynamo.config.cache_size_limit = 1000

-        pipe = self._init_pipeline(quantization_config, torch_dtype)
+        pipe = self._init_pipeline(self.quantization_config, torch_dtype)
        group_offload_kwargs = {
            "onload_device": torch.device("cuda"),
            "offload_device": torch.device("cpu"),
@@ -87,6 +83,17 @@ class QuantCompileTests(unittest.TestCase):
                if torch.device(component.device).type == "cpu":
                    component.to("cuda")

-        for _ in range(2):
-            # small resolutions to ensure speedy execution.
-            pipe("a dog", num_inference_steps=3, max_sequence_length=16, height=256, width=256)
+        # small resolutions to ensure speedy execution.
+        pipe("a dog", num_inference_steps=2, max_sequence_length=16, height=256, width=256)
+
+    def test_torch_compile(self):
+        self._test_torch_compile()
+
+    def test_torch_compile_with_cpu_offload(self):
+        self._test_torch_compile_with_cpu_offload()
+
+    def test_torch_compile_with_group_offload_leaf(self, use_stream=False):
+        for cls in inspect.getmro(self.__class__):
+            if "test_torch_compile_with_group_offload_leaf" in cls.__dict__ and cls is not QuantCompileTests:
+                return
+        self._test_torch_compile_with_group_offload_leaf(use_stream=use_stream)
@@ -630,7 +630,7 @@ class TorchAoSerializationTest(unittest.TestCase):


@require_torchao_version_greater_or_equal("0.7.0")
-class TorchAoCompileTest(QuantCompileTests):
+class TorchAoCompileTest(QuantCompileTests, unittest.TestCase):
    @property
    def quantization_config(self):
        return PipelineQuantizationConfig(
@@ -639,17 +639,15 @@ class TorchAoCompileTest(QuantCompileTests):
            },
        )

-    def test_torch_compile(self):
-        super()._test_torch_compile(quantization_config=self.quantization_config)
-
    @unittest.skip(
        "Changing the device of AQT tensor with module._apply (called from doing module.to() in accelerate) does not work "
        "when compiling."
    )
    def test_torch_compile_with_cpu_offload(self):
        # RuntimeError: _apply(): Couldn't swap Linear.weight
-        super()._test_torch_compile_with_cpu_offload(quantization_config=self.quantization_config)
+        super().test_torch_compile_with_cpu_offload()

+    @parameterized.expand([False, True])
    @unittest.skip(
        """
        For `use_stream=False`:
@@ -659,8 +657,7 @@ class TorchAoCompileTest(QuantCompileTests):
            Using non-default stream requires ability to pin tensors. AQT does not seem to support this yet in TorchAO.
        """
    )
-    @parameterized.expand([False, True])
-    def test_torch_compile_with_group_offload_leaf(self):
+    def test_torch_compile_with_group_offload_leaf(self, use_stream):
        # For use_stream=False:
        # If we run group offloading without compilation, we will see:
        #   RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "cuda:0".  This is no longer allowed; the devices must match.
@@ -673,7 +670,7 @@ class TorchAoCompileTest(QuantCompileTests):

        # For use_stream=True:
        # NotImplementedError: AffineQuantizedTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.is_pinned', overload='default')>, types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), arg_types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), kwarg_types={}
-        super()._test_torch_compile_with_group_offload_leaf(quantization_config=self.quantization_config)
+        super()._test_torch_compile_with_group_offload_leaf(use_stream=use_stream)


 # Slices for these tests have been obtained on our aws-g6e-xlarge-plus runners
Author	SHA1	Message	Date
DN6	532b395718	update	2025-07-17 21:56:48 +05:30
DN6	5c43924ac2	update	2025-07-17 19:57:45 +05:30
DN6	a633289e10	update	2025-07-16 19:41:48 +05:30
Aryan	06fd427797	[tests] Improve Flux tests (#11919 ) update	2025-07-15 10:47:41 +05:30
dependabot[bot]	48a551251d	Bump aiohttp from 3.10.10 to 3.12.14 in /examples/server (#11924 ) Bumps [aiohttp](https://github.com/aio-libs/aiohttp) from 3.10.10 to 3.12.14. - [Release notes](https://github.com/aio-libs/aiohttp/releases) - [Changelog](https://github.com/aio-libs/aiohttp/blob/master/CHANGES.rst) - [Commits](https://github.com/aio-libs/aiohttp/compare/v3.10.10...v3.12.14) --- updated-dependencies: - dependency-name: aiohttp dependency-version: 3.12.14 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-15 09:15:57 +05:30
Hengyue-Bi	6398fbc391	Fix: Align VAE processing in ControlNet SD3 training with inference (#11909 ) Fix: Apply vae_shift_factor in ControlNet SD3 training	2025-07-14 14:54:38 -04:00
Colle	3c8b67b371	Flux: pass joint_attention_kwargs when using gradient_checkpointing (#11814 ) Flux: pass joint_attention_kwargs when gradient_checkpointing	2025-07-11 08:35:18 -10:00
Steven Liu	9feb946432	[docs] torch.compile blog post (#11837 ) * add blog post * feedback * feedback	2025-07-11 10:29:40 -07:00
Aryan	c90352754a	Speedup model loading by 4-5x ⚡ (#11904 ) * update * update * update * pin accelerate version * add comment explanations * update docstring * make style * non_blocking does not matter for dtype cast * _empty_cache -> clear_cache * update * Update src/diffusers/models/model_loading_utils.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/diffusers/models/model_loading_utils.py --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>	2025-07-11 21:43:53 +05:30
Sayak Paul	7a935a0bbe	[tests] Unify compilation + offloading tests in quantization (#11910 ) * unify the quant compile + offloading tests. * fix * update	2025-07-11 17:02:29 +05:30
chenxiao	941b7fc084	Avoid creating tensor in CosmosAttnProcessor2_0 (#11761 ) (#11763 ) * Avoid creating tensor in CosmosAttnProcessor2_0 (#11761) * up --------- Co-authored-by: yiyixuxu <yixu310@gmail.com>	2025-07-10 11:51:05 -10:00
Álvaro Somoza	76a62ac9cc	[ControlnetUnion] Multiple Fixes (#11888 ) fixes --------- Co-authored-by: hlky <hlky@hlky.ac>	2025-07-10 14:35:28 -04:00
Sayak Paul	1c6ab9e900	[utils] account for MPS when available in get_device(). (#11905 ) * account for MPS when available in get_device(). * fix	2025-07-10 13:30:54 +05:30
Sayak Paul	265840a098	[LoRA] fix: disabling hooks when loading loras. (#11896 ) fix: disabling hooks when loading loras.	2025-07-10 10:30:10 +05:30
dependabot[bot]	9f4d997d8f	Bump torch from 2.4.1 to 2.7.0 in /examples/server (#11429 ) Bumps [torch](https://github.com/pytorch/pytorch) from 2.4.1 to 2.7.0. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md) - [Commits](https://github.com/pytorch/pytorch/compare/v2.4.1...v2.7.0) --- updated-dependencies: - dependency-name: torch dependency-version: 2.7.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2025-07-10 09:24:10 +05:30
Sayak Paul	b41abb2230	[quant] QoL improvements for pipeline-level quant config (#11876 ) * add repr for pipelinequantconfig. * update	2025-07-10 08:53:01 +05:30
YiYi Xu	f33b89bafb	The Modular Diffusers (#9672 ) adding modular diffusers as experimental feature --------- Co-authored-by: hlky <hlky@hlky.ac> Co-authored-by: Álvaro Somoza <asomoza@users.noreply.github.com> Co-authored-by: Aryan <aryan@huggingface.co> Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-07-09 16:00:28 -10:00
Álvaro Somoza	48a6d29550	[SD3] CFG Cutoff fix and official callback (#11890 ) fix and official callback Co-authored-by: YiYi Xu <yixu310@gmail.com>	2025-07-09 14:31:11 -04:00
Sayak Paul	2d3d376bc0	Fix unique memory address when doing group-offloading with disk (#11767 ) * fix memory address problem * add more tests * updates * updates * update * _group_id = group_id * update * Apply suggestions from code review Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> * update * update * update * fix --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2025-07-09 21:29:34 +05:30
Sébastien Iooss	db715e2c8c	feat: add multiple input image support in Flux Kontext (#11880 ) * feat: add multiple input image support in Flux Kontext * move model to community * fix linter	2025-07-09 11:09:59 -04:00
Sayak Paul	754fe85cac	[tests] add compile + offload tests for GGUF. (#11740 ) * add compile + offload tests for GGUF. * quality * add init. * prop. * change to flux. --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2025-07-09 13:42:13 +05:30
Sayak Paul	cc1f9a2ce3	[tests] mark the wanvace lora tester flaky (#11883 ) * mark wanvace lora tests as flaky * ability to apply is_flaky at a class-level * update * increase max_attempt. * increase attemtp.	2025-07-09 13:27:15 +05:30
Sayak Paul	737d7fc3b0	[tests] Remove more deprecated tests (#11895 ) * remove k diffusion tests * remove script	2025-07-09 13:10:44 +05:30
Sayak Paul	be23f7df00	[Docker] update doc builder dockerfile to include quant libs. (#11728 ) update doc builder dockerfile to include quant libs.	2025-07-09 12:27:22 +05:30
Sayak Paul	86becea77f	Pin k-diffusion for CI (#11894 ) * remove k-diffusion as we don't use it from the core. * Revert "remove k-diffusion as we don't use it from the core." This reverts commit `8bc86925a0`. * pin k-diffusion	2025-07-09 12:17:45 +05:30
Dhruv Nair	7e3bf4aff6	[CI] Speed up GPU PR Tests (#11887 ) update Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>	2025-07-09 11:00:23 +05:30
shm4r7	de043c6044	Update chroma.md (#11891 ) Fix typo in Inference example code	2025-07-09 09:58:38 +05:30
Sayak Paul	4c20624cc6	[tests] annotate compilation test classes with bnb (#11715 ) annotate compilation test classes with bnb	2025-07-09 09:24:52 +05:30
Aryan	0454fbb30b	First Block Cache (#11180 ) * update * modify flux single blocks to make compatible with cache techniques (without too much model-specific intrusion code) * remove debug logs * update * cache context for different batches of data * fix hs residual bug for single return outputs; support ltx * fix controlnet flux * support flux, ltx i2v, ltx condition * update * update * Update docs/source/en/api/cache.md * Update src/diffusers/hooks/hooks.py Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com> * address review comments pt. 1 * address review comments pt. 2 * cache context refacotr; address review pt. 3 * address review comments * metadata registration with decorators instead of centralized * support cogvideox * support mochi * fix * remove unused function * remove central registry based on review * update --------- Co-authored-by: Dhruv Nair <dhruv.nair@gmail.com>	2025-07-09 03:27:15 +05:30
Dhruv Nair	cbc8ced20f	[CI] Fix big GPU test marker (#11786 ) * update * update	2025-07-08 22:09:09 +05:30