Merge branch 'main' into fix_spec_gate

Signed-off-by: Zheyu Fu <zheyuf@nvidia.com>
2026-02-11 05:23:38 +08:00 · 2026-01-09 13:45:30 -08:00 · 2026-01-09 13:45:30 -08:00 · 6412f6f933
commit 6412f6f933
parent 8b5a8c2304 38f249b479
37 changed files with 774 additions and 148 deletions
--- a/docs/source/blogs/media/tech_blog15_ds32_wide_ep.png
+++ b/docs/source/blogs/media/tech_blog15_ds32_wide_ep.png
--- a/docs/source/blogs/media/tech_blog15_dsa_architecture.png
+++ b/docs/source/blogs/media/tech_blog15_dsa_architecture.png
--- a/docs/source/blogs/media/tech_blog15_indexer_topk.png
+++ b/docs/source/blogs/media/tech_blog15_indexer_topk.png
--- a/docs/source/blogs/media/tech_blog15_radix_select_topk.png
+++ b/docs/source/blogs/media/tech_blog15_radix_select_topk.png
--- a/docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md
+++ b/docs/source/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.md
@ -0,0 +1,423 @@
+# Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs
+By NVIDIA TensorRT LLM team
+
+## Table of Contents
+- [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](#optimizing-deepseek-v32-on-nvidia-blackwell-gpus)
+    - [Table of Contents](#table-of-contents)
+    - [Introduction](#introduction)
+    - [DeepSeek Sparse Attention (DSA)](#deepseek-sparse-attention-dsa)
+    - [Precision Strategy](#precision-strategy)
+    - [Parallel Strategy](#parallel-strategy)
+    - [Key Features](#key-features)
+        - [MTP](#mtp)
+        - [Disaggregated Serving](#disaggregated-serving)
+        - [Chunked Prefill and KV Cache Reuse](#chunked-prefill-and-kv-cache-reuse)
+        - [Wide Expert Parallelism (Wide-EP)](#wide-expert-parallelism-wide-ep)
+		- [Chat Template and Tool Parser](#chat-template-and-tool-parser)
+    - [Key Optimizations](#key-optimizations)
+        - [Kernel Optimizations](#kernel-optimizations)
+            - [Sparse MLA Kernel](#sparse-mla-kernel)
+            - [Indexer Top-K Kernel](#indexer-top-k-kernel)
+            - [DeepGEMM MQA Kernel](#deepgemm-mqa-kernel)
+            - [Kernel Fusion](#kernel-fusion)
+        - [System Optimizations](#system-optimizations)
+            - [Multi-steams](#multi-steams)
+            - [A Fast Path for Short Sequences](#a-fast-path-for-short-sequences)
+    - [How to Reproduce](#how-to-reproduce)
+        - [Accuracy Evaluation](#accuracy-evaluation)
+        - [Benchmark on B200](#benchmark-on-b200)
+            - [Min-latency](#min-latency)
+            - [Max-throughput](#max-throughput)
+        - [Benchmark with Wide-EP on GB200](#benchmark-with-wide-ep-on-gb200)
+    - [Future Works](#future-works)
+    - [Acknowledgement](#acknowledgement)
+
+## Introduction
+The open-sourced [DeepSeek-V3.2](https://api-docs.deepseek.com/news/news251201) series models proposed a new architecture with a fine-grained sparse attention mechanism, called DeepSeek Sparse Attention (DSA). It can help the DeepSeek-V3.2 model achieve better efficiency, especially in long sequence scenarios. Although DSA uses a lightweight indexer for prediction, realizing actual speedup from attention sparsity is still challenging. This blog introduces how TensorRT LLM supports key LLM inference features for DeepSeek-v3.2 and optimizes its performance on NVIDIA Blackwell GPUs.
+
+## DeepSeek Sparse Attention (DSA)
+DSA serves as a core component of the DeepSeek-v3.2 model, and it is the only architectural modification compared to its predecessors (DeepSeek-V3/R1/V3.1). It is a fine-grained sparse attention mechanism that only selects the important key-value entries for attention computation.
+
+<div align="center">
+<figure>
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog15_dsa_architecture.png" alt="tech_blog15_dsa_architecture" width="700" height="auto">
+</figure>
+</div>
+<p align="center"><sub><em>Figure 1. The architecture of DSA. The green part illustrates how DSA selects the Top-K key-value entries according to the indexer.</em></sub></p>
+
+Figure 1 illustrates the overall architecture: a lightning indexer first determines the importance of all key-value entries for each query token. Subsequently, the Top-K Selector retains only the top-$k$ entries (typically $k=2048$) based on the index scores. Finally, attention is computed exclusively between the query token and these selected entries.
+
+<div align="center">
+<figure>
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog15_indexer_topk.png" alt="tech_blog15_indexer_topk" width="900" height="auto">
+</figure>
+</div>
+<p align="center"><sub><em>Figure 2. The architecture of the DSA indexer and Top-K logics.</em></sub></p>
+
+Figure 2 illustrates the DSA indexer and the Top-K selection mechanism. Firstly, two low-rank linear layers project $c_t^Q$ and the input $h_t$ into lower-dimensional tensors. Following operations of LayerNorm to the K tensor and RoPE to both Q and K, we obtain the tensors $Q_t^I$ and $K_t^I$. Simultaneously, a separate weight projection layer processes $h_t$ to generate the weights $W_t^I$. These tensors are then used to compute the index scores (labeled as MQA Logits in Figure 2):
+
+$$I_{t} = \sum_{j=1}^{h}W_j^I \cdot \text{ReLU}(Q_{t, j}^I (K_t^I)^T)$$
+
+Finally, a Top-K operation is applied to the index scores to identify the most relevant indices, which are subsequently used for the sparse MLA computation. To reduce computational overhead, the K tensor $K_t^I$ is stored in the indexer K cache, allowing for reuse in subsequent iterations.
+ 
+Regarding implementation, DSA diverges from the MLA used in DeepSeek-V3/R1/V3.1 models, which alternates between MHA mode (prefill) and MQA mode (decoding) as discussed in [Tech Blog 3](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md). Instead, our current DSA implementation operates only in MQA mode for both prefill and decoding phases to maximize kernel efficiency. We are continuing to explore further optimizations, including potential support for MHA mode in future iterations.
+
+The DSA implementation is built upon the TensorRT LLM sparse attention framework, which is designed to provide flexible and extensible support for various sparse attention methods. For more information, please refer to the [sparse attention documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/sparse-attention.md), and a technical blog providing further details will be released soon.
+
+## Precision Strategy
+Because the DSA is the only architectural modification of DeepSeek-V3.2 from the DeepSeek-R1 model, the mixed precision recipe for other modules is the same as what is used for the DeepSeek-R1. This is the NVFP4 precision strategy used in the DSA module:
+- Indexer
+    - Low-rank linear layers: BF16
+    - Weight projection layer: FP32, for model accuracy
+    - MQA:
+        - Indexer K cache: Blockwise FP8
+        - Math: Blockwise FP8
+    - Top-K: FP32
+- QKV projection layer: BF16
+- Output projection layer: NVFP4
+- Sparse MLA
+    - KV cache: Per-tensor FP8
+    - Math: Per-tensor FP8
+
+The MoE layers use NVFP4, which is the same as the DeepSeek-R1. Please refer to [Tech Blog 1](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) and [Tech Blog 3](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md) for the MoE precision strategy. In addition to the NVFP4 version of DeepSeek-V3.2, TensorRT-LLM also supports the original FP8 model, as well as both BF16 and per-tensor FP8 KV caches.
+
+We evaluated the accuracy of this NVFP4 checkpoint on the same datasets:
+|                    | GSM8k | MMLU  | GPQA-Diamond |
+| :----------------- | :---- | :---- | :----------- |
+| [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)   | 95.91 | 87.84 | 84.34        |
+| nvidia/DeepSeek-V3.2-NVFP4<sup>*</sup> | 95.26 | 87.54 | 84.85        |
+
+<sub><em>\* Currently, the NVFP4 checkpoint has not yet been published on Hugging Face. Please stay tuned, or refer to the [How to reproduce](#how-to-reproduce) section to learn how to quantize the model to NVFP4.  
+** Note there are some run-to-run variance for these evaluations. Our experiments indicate that the NVFP4 recipe delivers accuracy on par with FP8 on these datasets.</em></sub>
+
+## Parallel Strategy
+To achieve optimal throughput, DeepSeek-V3.2 adopts the same parallel strategy as DeepSeek-R1. Please refer to [Tech Blog 3](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md) for a detailed explanation of the performance benefits:
+| Components         | Parallel Patterns           |
+| :----------------- | :-------------------------- |
+| Attention Modules  | Data Parallelism 8 (DP8)    |
+| MoE Sparse Experts | Expert Parallelism 8 (EP8)  |
+| MoE Shared Experts | DP8                         |
+| Router GEMM        | DP8                         |
+ 
+To scale DeepSeek-V3.2 inference on high-performance systems such as the GB200 NVL72, the model also leverages the parallel strategy from DeepSeek-R1. Please refer to [Tech Blog 4](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md), [Tech Blog 8](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md), and [Tech Blog 14](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.md) for more details.
+ 
+The difference lies in the DSA indexer. When utilizing Tensor Parallelism (TP) for attention modules, typically in latency-oriented scenarios, TP is not applied to the indexer layers. Instead, it is applied exclusively to the MLA components (i.e., the remaining layers of the attention module).
+
+## Key Features
+In TensorRT LLM, there are many advanced features that are crucial for maximizing LLM inference performance, such as CUDA Graph, Overlap Scheduler, Speculative Decoding, etc. Given the architectural innovations in DeepSeek-V3.2, ensuring its compatibility with these features is important.
+
+As illustrated in [Tech Blog 3](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md), both CUDA Graph and the Overlap Scheduler offer significant throughput improvements. For CUDA Graph support, which is typically enabled during decoding-only iterations where all requests are in the decoding phase, we must ensure that kernels in the DSA module support graph capture and that input/output tensor shapes remain consistent for a given batch size. Regarding the Overlap Scheduler, it is critical to eliminate any CPU-GPU synchronization within the DSA forward, as this would disrupt the execution pipeline. Other key features are discussed in the following subsections.
+
+### MTP
+Multi-Token Prediction (MTP) is a speculative decoding method used in DeepSeek series models. It verifies and accepts multiple draft tokens in a single iteration, significantly improving inference performance in both low-latency and high-throughput scenarios. The DeepSeek-V3.2 also supports MTP. For latency-critical scenarios, as detailed in [Tech Blog 1](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md), MTP-3 is recommended to maximize GPU utilization and achieve optimal performance. For other scenarios, MTP-1 typically offers performance gains as well.
+
+However, the decoding indexer MQA kernel supports sequence lengths of only 1 or 2, limiting native support to MTP-off or MTP-1. To enable MTP > 1, we offer two solutions. The long-term solution involves updating the MQA kernel to support larger sequence lengths, which will be introduced in the MQA kernel optimization section. The immediate workaround (in [PR-9045](https://github.com/NVIDIA/TensorRT-LLM/pull/9045)) uses the existing kernel by flattening the sequence length dimension into the batch dimension, treating the input as a tensor with a sequence length of 1. While this approach ignores the causal mask during the indexer MQA forward, causing discrepancies in the diagonal regions compared to ground truth, the subsequent Top-K kernel handles causal masking correctly. Therefore, the final Top-K indices remain unaffected, allowing this workaround to support MTP-N for any N.
+
+### Disaggregated Serving
+Disaggregated serving decouples the prefill and decoding phases, allowing them to run on separate GPU pools with optimized parallel strategies. This feature is crucial for deploying LLMs on high-performance systems like GB200 NVIDIA GPU HWs. However, it requires transferring KV cache blocks from the prefill to the decoding GPUs. DeepSeek-V3.2 introduces an additional 'indexer K cache,' which presents unique challenges for cache management and transmission in a disaggregated setup.
+
+To address this, [PR-8699](https://github.com/NVIDIA/TensorRT-LLM/pull/8699) integrated indexer K cache support into the existing kvCacheManager, enabling it to inherit existing cache features. Subsequently, [PR-8735](https://github.com/NVIDIA/TensorRT-LLM/pull/8735) extended disaggregated serving capabilities to DeepSeek-V3.2, allowing TensorRT LLM to handle the transmission of the indexer K cache. Currently, the implementation specifically targets the indexer K cache, but we plan to generalize this support in future updates.
+
+### Chunked Prefill and KV Cache Reuse
+Two additional critical features are chunked prefill and KV cache reuse. Chunked prefill removes input length constraints for long prompts and enables prefill chunks to be batched alongside more decoding requests, boosting throughput. KV cache reuse allows requests sharing common prefixes (e.g., system prompts or multi-turn conversations) to share cached blocks, drastically reducing time-to-first-token (TTFT).
+
+On the implementation side, kvCacheManager already supports the newly introduced indexer K cache, extending compatibility to both chunked prefill and KV cache reuse. Then [PR-9376](https://github.com/NVIDIA/TensorRT-LLM/pull/9376) enabled DSA to perform prefill computation with past tokens saved in the cache, thereby unlocking chunked prefill support. Building on this, [PR-9383](https://github.com/NVIDIA/TensorRT-LLM/pull/9383) implemented KV cache reuse for DeepSeek-V3.2 by reusing the chunked prefill changes.
+
+### Wide Expert Parallelism (Wide-EP) 
+The Wide-EP is an important feature for boosting inference throughput in large-scale Mixture-of-Experts (MoE) models. For the DeepSeek-V3.2 model, after supporting the disaggregated serving, [PR-9245](https://github.com/NVIDIA/TensorRT-LLM/pull/9245) simply registered the model with the Expert Parallelism Load Balancer (EPLB). This integration allows Wide-EP and EPLB to be enabled, significantly enhancing performance.
+
+### Chat Template and Tool Parser
+DeepSeek-V3.2 introduces a new chat template compared to prior versions. This update incorporates support for tool calling and the 'thinking with tools' capability. These enhancements, along with the necessary tool parser, were implemented in [PR-9814](https://github.com/NVIDIA/TensorRT-LLM/pull/9814) and [PR-10126](https://github.com/NVIDIA/TensorRT-LLM/pull/10126). To enable this new chat template when deploying with `trtllm-serve` or `trtllm-eval`, please specify the argument `--custom_tokenizer deepseek_v32`.
+
+## Key Optimizations
+DeepSeek-V3.2 can inherit the MoE optimizations from DeepSeek-R1. Consequently, this section focuses exclusively on the DSA part, covering both kernel and system-level optimizations.
+
+### Kernel Optimizations
+
+#### Sparse MLA Kernel
+Sparse MLA serves as the core kernel of DSA, enabling attention computation with fine-grained token sparsity. To efficiently support this sparsity pattern, we leverage the new TMALDG.Gather4 instruction on Blackwell GPUs. This instruction loads four rows from a source 2D tensor and coalesces them into a single destination tensor, making it ideal for fine-grained sparse attention operations.
+ 
+Similar to the dense MLA kernel, FP8 KV cache optimization is crucial for reducing KV cache size and improving E2E throughput. For DSA, we employ per-tensor FP8 quantization: both Query (Q) and Key-Value (KV) tensors are quantized, and FP8 arithmetic is utilized for the sparse MLA computation. To validate the model accuracy under this configuration, the table below presents the GPQA-Diamond accuracy comparison between BF16 and per-tensor FP8 KV cache for the DeepSeek-V3.2-Exp model. PR-8692 introduced this FP8 sparse MLA support, yielding up to a 47.03% improvement in throughput (TPS/GPU).
+
+| KV Cache Type                | FP8 checkpoint | NVFP4 checkpoint |
+| :--------------------------- | :------------- | :--------------- |
+| BF16 Sparse MLA and KV cache | 80.30          | 79.29            |
+| FP8 Sparse MLA and KV cache  | 78.28          | 80.30            |
+
+Another important optimization is SwapsMmaAb, designed specifically for Tensor Parallelism (TP) scenarios. When TP is enabled for sparse MLA, input tensors are partitioned along the Q head dimension. Consequently, each rank processes a reduced number of Q heads ($128 / \text{TP}$), leading to Tensor Core underutilization. SwapsMmaAb addresses this bottleneck by swapping the A and B operands during matrix multiplication to improve hardware utilization.
+
+#### Indexer Top-K Kernel
+DSA contains a module called Top-K Selector. It is a fine-grained token selection mechanism that retrieves only the key-value entries corresponding to the Top-K index scores. The index scores are from Lightning Indexer. This part will select the top 2048 tokens for each query. 
+
+##### Deterministic Top-K vs Non-deterministic Top-K
+The Top‑K problem aims to find the largest (or smallest) K elements from a set of N candidates. Because some of the N candidates may have identical values, there can be more than K elements that are tied with the K‑th element. In such cases, deciding which of the tied elements are included in the final Top‑K set affects whether the output is deterministic. If the tied elements are selected randomly, the results will be non‑deterministic. Conversely, if we always prioritize elements with smaller indices, the results will be deterministic.
+
+Obtaining deterministic results generally requires a more complex algorithm and incurs higher latency than a non‑deterministic version. In DeepSeek V3.2, we first need to determine whether such determinism is actually necessary. We compare the accuracy between the deterministic (DE) and non‑deterministic versions of Top‑K with the GPQA-Diamond dataset. The scores are pretty close:
+| GPQA-Diamond | DE Top-K | Non-DE Top-K |
+| :----------- | :------ | :---------- |
+| FP8 model    | 79.8    | 79.9        |
+| NVFP4 model  | 80.3    | 79.4        |
+
+So we decided to use the non‑DE parallel Top‑K algorithm for DeepSeek V3.2.
+
+##### Radix-select-based Top-K Parallel Algorithm
+<div align="center">
+<figure>
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog15_radix_select_topk.png" alt="tech_blog15_radix_select_topk" width="1280" height="auto">
+</figure>
+</div>
+<p align="center"><sub><em>Figure 3. Radix-select-based Top-K.</em></sub></p>
+
+In general, there are two kinds of parallel Top‑K algorithms: partition‑based methods and priority‑queue‑based methods. The runtime of existing priority‑queue approaches grows rapidly as K increases, and the K value is as large as 2048 for the indexer Top-K in deepseek v3.2, so we choose partition‑based methods instead. Specifically, we adopt radix‑select as our baseline.
+For 32‑bit values with 8‑bit digits, a naïve radix Top‑K algorithm runs 4 iterations, with 4 kernel launches per iteration. In each iteration, it (1) Histogram: counts how many elements fall into each digit bucket based on the current bits; (2) Prefix Sum: builds a prefix sum over these bucket counts; (3) Find target digits: identifies which bucket contains the K‑th element; and (4) Filtering: keeps all elements in smaller buckets as definite Top‑K, discards elements in larger buckets, and passes elements in the target bucket to the next iteration as new candidates.
+
+##### Optimizations for Indexer Top-K
+**Skip iterations with parallel sorting.** In addition to the basic radix‑select method, we introduce further optimizations to speed up the Top‑K computation. In practice, with either 8‑bit radix select (four iterations) or 11‑bit radix select (three iterations), the number of candidates typically drops sharply after the first one or two iterations on real datasets.
+Our key optimization is to bypass the remaining radix‑select iterations and switch to a parallel sort once the candidate set becomes sufficiently small (smaller than 2048 in the current implementation). When the number of candidates is relatively small, we use a low-overhead naive O(N²) comparison-based ranking algorithm. For each element, we compare it against all others to determine its final position, and if this position is smaller than K, we keep it as part of the Top‑K output.  Otherwise, we use the parallel sort from CUB to get the results. The basic implementation and this optimization were added in [PR-8882](https://github.com/NVIDIA/TensorRT-LLM/pull/8882).
+
+**Specialization for different cases.** When running with real datasets, we found that the number of candidates reaching the final sorting stage was larger than expected, which resulted in higher runtime overhead. To address this issue, [PR-9255](https://github.com/NVIDIA/TensorRT-LLM/pull/9255) introduced an additional preliminary bin-distribution step to reduce the number of candidates more efficiently before the final sort. This preprocessing step halves the candidate set and uses the leading 11 bits of each value to compute its bin index.  
+
+##### Performance Results
+
+<sub><em>Table1: Compare the performance of torch.topk and our customized Top-K op on B200.</em></sub>
+| File                          | torch.topk(us) | TopKPerRow(us) | Speedup |
+|-------------------------------|----------------|----------------|---------|
+| topk_inputs_layer0_rank0.npy  | 106.877        | 14.069         | 7.596   |
+| topk_inputs_layer0_rank1.npy  | 109.501        | 14.217         | 7.702   |
+| topk_inputs_layer0_rank2.npy  | 104.616        | 14.079         | 7.431   |
+| topk_inputs_layer0_rank3.npy  | 105.049        | 14.016         | 7.495   |
+| topk_inputs_layer0_rank4.npy  | 105.526        | 14.073         | 7.498   |
+| topk_inputs_layer0_rank5.npy  | 105.034        | 13.986         | 7.510   |
+| topk_inputs_layer0_rank6.npy  | 104.516        | 14.079         | 7.423   |
+| topk_inputs_layer0_rank7.npy  | 105.099        | 14.189         | 7.407   |
+| topk_inputs_layer10_rank0.npy | 109.614        | 15.281         | 7.173   |
+| topk_inputs_layer10_rank1.npy | 104.838        | 15.284         | 6.859   |
+| Average                       | 106.067        | 14.327         | 7.410   |
+
+We use the data that is exported from real datasets across different layers. The input tensor size for each case is [64, 9295]. We select the top 2048 from the valid candidates for each query. As shown in Table 1, compared to the native torch.topk implementation, our implementation achieves an average speedup of 7.41x. This significantly optimizes the duration of the indexer module.
+
+Overall, by replacing the DE-version Top-K from PyTorch with our customized non-DE Top-K kernel, which brings 25%~40% and 14%~24% e2e speedup for the low latency and throughput scenarios.
+
+#### DeepGEMM MQA Kernel
+The DeepGEMM MQA kernel computes logits for the Top-K selection process. To enhance efficiency on Blackwell GPUs, several optimizations were implemented targeting both performance and ease of use:
+
+- Larger MMA Tile Size: We increased the MMA tile size for both the prefill and decoding MQA kernels, yielding up to a 10% performance improvement. This optimization was implemented in commit [2f9d878](https://github.com/deepseek-ai/DeepGEMM/commit/2f9d87877ed691a62796c25f2e9496a5e0b7123a) and [fc97232](https://github.com/deepseek-ai/DeepGEMM/commit/fc97232c6f23bf5b4be5bdef52af8ce5dc499460).
+- Flexible Paged KV Cache Configurations: The decoding MQA kernel now supports a wider range of configurations. While the initial version was restricted to a block size of 64 tokens, commit [c5d4d74](https://github.com/deepseek-ai/DeepGEMM/commit/c5d4d7448665ae90a81d9d31d60d445010da50f0) enabled support for any block size $B$ satisfying the condition $64 \% B = 0$.
+- MTP-3 Support: Previously, the kernel was limited to MTP-0 or MTP-1 (predicting at most one draft token). Since MTP-3 typically delivers superior performance in low-latency scenarios, optimizations were introduced (see commit [2be3f36](https://github.com/deepseek-ai/DeepGEMM/commit/2be3f367854702e3887ff5b28b274cb16b441af9)) to enable native MTP-3 support.
+
+#### Kernel Fusion
+Kernel fusion is a standard optimization technique for improving performance. For DeepSeek-V3.2, we implemented specific fusion strategies:
+
+- Custom Kernels for Indexer K Cache Population: The indexer MQA utilizes blockwise FP8 for both Q and K inputs, requiring the indexer K cache to store data in a specific blockwise FP8 format. During the forward pass, the indexer K tensor must be quantized, and both the values and scaling factors are saved to the cache. To optimize this, [PR-8701](https://github.com/NVIDIA/TensorRT-LLM/pull/8701) fused the blockwise FP8 quantization logic into a single kernel. Since the original PyTorch operations were a bottleneck, this resulted in a significant 32.64%–64.20% improvement in inference throughput. Subsequently, [PR-8960](https://github.com/NVIDIA/TensorRT-LLM/pull/8960) fused indexer K tensor storage operations into a custom kernel, delivering an additional 3.5%–13.4% end-to-end (E2E) performance gain.
+- Fusing Small Kernels via torch.compile(): Beyond the major kernels, DSA involves numerous small kernels with low latencies. To reduce kernel launch overhead, we leverage torch.compile() to fuse these smaller operations:
+    - [PR-8988](https://github.com/NVIDIA/TensorRT-LLM/pull/8988) consolidated indexer weight scaling for blockwise FP8 quantization.
+    - [PR-9052](https://github.com/NVIDIA/TensorRT-LLM/pull/9052) fused LayerNorm operations, yielding around 1.42% speedup for low-latency scenarios and 1.90% for throughput-oriented scenarios.
+
+### System Optimizations
+
+#### Multi-steams
+Multi-stream execution is leveraged in the following optimizations:
+
+- [PR-8988](https://github.com/NVIDIA/TensorRT-LLM/pull/8988) employs multi-stream to overlap indexer weight scaling with the indexer K cache update. Combined with torch.compile() optimization for the indexer weight scaling, this yields approximately 2.53% speedup in low-latency scenarios.
+- When improving the blockwise FP8 quantization in [PR-8701](https://github.com/NVIDIA/TensorRT-LLM/pull/8701), multi-stream is also used to enable concurrent quantization of the indexer Q and K tensors.
+- [PR-9243](https://github.com/NVIDIA/TensorRT-LLM/pull/9243) changed the indexer weight projection GEMM to FP32 to improve accuracy. However, this introduced a performance regression compared to the low-precision implementation. To mitigate this, multi-stream is utilized to overlap the FP32 weight projection GEMM with the indexer low-rank Q projection GEMM, LayerNorm, and Q/K RoPE operations.
+
+#### A Fast Path for Short Sequences
+DeepSeek-V3.2 employs K=2048 for the Top-K selector. For sequences with length $N \le 2048$, all past KV tokens are inherently selected, rendering the MQA and Top-K operations redundant. [PR-9524](https://github.com/NVIDIA/TensorRT-LLM/pull/9524) implements a "fast path" to bypass these unnecessary operations for short sequences.
+
+For the implementation, we can simply generate dense indices during DSA preparation, and directly change to use these dense indices in the indexer forward for prefill requests. However, decoding requests present a challenge due to CUDA Graph integration since the CUDA graph is usually enabled for decoding-only iterations. To ensure compatibility, we capture separate CUDA graphs for short and long sequences. At the start of each iteration, the system checks the sequence lengths: if any request in the batch exceeds the threshold, the long-sequence graph is triggered; otherwise, the short-sequence graph is utilized. This optimization yields approximately 1.03x speedup for 1K/1K scenarios.
+
+## How to Reproduce
+This section provides the reproducing steps for NVIDIA Blackwell B200 GPUs, for both model accuracy evaluation and performance benchmark.
+ 
+The DeepSeek-V3.2 FP4 model is used for evaluation and benchmarking. You can follow [the command of the Model-Optimizer](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/deepseek#experimental-deepseek-v32) to quantize the original DeepSeek-V3.2 model to FP4.
+
+### Accuracy Evaluation
+Evaluate the model accuracy using trtllm-eval.
+
+1. Prepare an advanced configuration file:
+```
+cat >./config.yml <<EOF
+cuda_graph_config:
+	enable_padding: true
+	batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128]
+enable_attention_dp: true
+kv_cache_config:
+	free_gpu_memory_fraction: 0.8
+	dtype: fp8
+moe_config:
+	backend: TRTLLM
+speculative_config:
+	decoding_type: MTP
+	num_nextn_predict_layers: 1
+EOF
+```
+2. Evaluate accuracy on the [MMLU](https://people.eecs.berkeley.edu/~hendrycks/data.tar) dataset:
+```
+model_path=<your model path>
+trtllm-eval --model ${model_path} \
+	--tp_size 8 \
+	--ep_size 8 \
+	--kv_cache_free_gpu_memory_fraction 0.8 \
+	--config ./config.yml \
+	--custom_tokenizer deepseek_v32 \
+	mmlu
+```
+3. Evaluate accuracy on the [GSM8K](https://huggingface.co/datasets/openai/gsm8k) dataset:
+```
+trtllm-eval --model ${model_path} \
+	--tp_size 8 \
+	--ep_size 8 \
+	--kv_cache_free_gpu_memory_fraction 0.8 \
+	--config ./config.yml \
+	--custom_tokenizer deepseek_v32 \
+	gsm8k
+```
+4. Evaluate accuracy on the [GPQA-Diamond](https://huggingface.co/datasets/Idavidrein/gpqa) dataset:
+```
+trtllm-eval --model ${model_path} \
+	--tp_size 8 \
+	--ep_size 8 \
+	--kv_cache_free_gpu_memory_fraction 0.8 \
+	--config ./config.yml \
+	--custom_tokenizer deepseek_v32 \
+	gpqa_diamond \
+	--apply_chat_template \
+	--chat_template_kwargs '{"thinking": true}' \
+	--max_output_length 120000
+```
+
+### Benchmark on B200
+
+#### Min-latency
+Our benchmark results are based on Batch = 1, ISL = 8K, OSL = 1K, num_requests = 10 from a synthetic dataset.
+To do the benchmark, run the following command:
+```
+data_path=<your dataset file following the format>
+model_path=<your model path>
+ 
+cat <<EOF > ./config.yml
+cuda_graph_config:
+	enable_padding: true
+	batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128]
+kv_cache_config:
+	free_gpu_memory_fraction: 0.8
+	dtype: fp8
+moe_config:
+	backend: TRTLLM
+speculative_config:
+	decoding_type: MTP
+	num_nextn_predict_layers: 3
+EOF
+ 
+trtllm-bench -m deepseek-ai/DeepSeek-V3.2-Exp \
+	--model_path ${model_path} throughput \
+	--tp 4 \
+	--warmup 1 \
+	--dataset ${data_path} \
+	--backend pytorch \
+	--max_batch_size 1 \
+	--max_num_tokens 8384 \
+	--kv_cache_free_gpu_mem_fraction 0.8 \
+	--concurrency 1 \
+	--config ./config.yml \
+	--num_requests 10 \
+	--streaming
+```
+The expected results:
+```
+===========================================================
+= PERFORMANCE OVERVIEW
+===========================================================
+Request Throughput (req/sec):                 	  0.2678
+Total Output Throughput (tokens/sec):         	  274.1786
+Total Token Throughput (tokens/sec):          	  2467.6070
+Total Latency (ms):                           	  37347.9238
+Average request latency (ms):                 	  3734.7334
+Per User Output Throughput [w/ ctx] (tps/user):   276.2231
+Per GPU Output Throughput (tps/gpu):          	  68.5446
+Average time-to-first-token [TTFT] (ms):      	  425.9885
+Average time-per-output-token [TPOT] (ms):    	  3.2344
+Per User Output Speed (tps/user):             	  312.0708
+```
+<sub><em>\* Note that `max_num_tokens` is set to a large value to cover the maximum sequence length. Please refer to the [Best Performance Practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#wip-enable-more-features-by-default) for more details on `max_num_tokens` configuration.</em></sub>
+
+#### Max-throughput
+Our benchmark results are based on Batch = 256, ISL = 8K, OSL = 1K, num_requests = 768 from a synthetic dataset. 
+To do the benchmark, run the following command:
+```
+data_path=<your dataset file following the format>
+model_path=<your model path>
+ 
+cat <<EOF > ./config.yml
+cuda_graph_config:
+	enable_padding: true
+    batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64,128]
+enable_attention_dp: true
+kv_cache_config:
+	free_gpu_memory_fraction: 0.8
+	dtype: fp8
+moe_config:
+	backend: TRTLLM
+speculative_config:
+	decoding_type: MTP
+	num_nextn_predict_layers: 1
+EOF
+ 
+trtllm-bench -m deepseek-ai/DeepSeek-V3.2-Exp \
+	--model_path ${model_path} throughput \
+	--tp 8 \
+	--ep 8 \
+	--warmup 1 \
+	--dataset ${data_path} \
+	--backend pytorch \
+	--max_batch_size 256 \
+	--max_num_tokens 8576 \
+	--kv_cache_free_gpu_mem_fraction 0.8 \
+	--concurrency 256 \
+	--config ./config.yml \
+	--num_requests 768 \
+	--streaming
+```
+The expected results:
+```
+===========================================================
+= PERFORMANCE OVERVIEW 
+===========================================================
+Request Throughput (req/sec):                     8.4162
+Total Output Throughput (tokens/sec):             8618.2158
+Total Token Throughput (tokens/sec):              77563.9425
+Total Latency (ms):                               365009.1921
+Average request latency (ms):                     120325.7013
+Per User Output Throughput [w/ ctx] (tps/user):   9.8876
+Per GPU Output Throughput (tps/gpu):              1077.2770
+Average time-to-first-token [TTFT] (ms):          19537.7776
+Average time-per-output-token [TPOT] (ms):        98.5219
+Per User Output Speed (tps/user):                 11.2591
+```
+
+### Benchmark with Wide-EP on GB200
+To validate the efficacy of Wide-EP on DeepSeek-V3.2, we evaluated performance using the NVFP4 model on a GB200 NVL72 system. We compared EP16 and EP32 configurations against EP4 and EP8 baselines, with benchmarks conducted at ISL=8K and OSL=1K using the [Rate Matching](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md#measurement-methodology) methodology.
+
+<div align="center">
+<figure>
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog15_ds32_wide_ep.png" alt="tech_blog15_ds32_wide_ep" width="700" height="auto">
+</figure>
+</div>
+<p align="center"><sub><em>Figure 4. DeepSeek-V3.2 throughput on ISL/OSL 8k/1k. Note that the numbers were collected on November 20th, and more optimizations are still on-going.</em></sub></p>
+
+As illustrated in Figure 4, Wide-EP yields up to a 2.28x improvement in per-GPU output throughput. To reproduce these results, please refer to the [examples/wide_ep/slurm_scripts](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) directory. These scripts demonstrate how to launch disaggregated serving with large-scale EP and associated features on a SLURM cluster.
+
+## Future Works
+
+- Optimize performance for long-sequence scenarios (e.g., ISL=32K, OSL=4K).
+- Optimize performance for large Expert Parallelism (EP) configurations.
+- Evaluate dense MHA versus MQA modes for context sparse MLA to determine the optimal configuration for processing short sequences.
+- Explore more aggressive quantization strategies for DSA.
+- Optimize the implementation of the indexer Top-K kernel.
+- Investigate KV cache offloading mechanisms for DSA.
+
+## Acknowledgement
+Achieving these remarkable performance gains since the release of DeepSeek-V3.2-Exp was truly a collaborative triumph. We extend our deepest gratitude to everyone who contributed to the functional implementation and performance optimization of the DeepSeek-V3.2 model.
+
+This work serves as a testament to TensorRT LLM's flexibility and effectiveness in supporting architectural innovations and novel sparse attention mechanisms. We hope this work paves the way for further advancements in sparse attention support.
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@ -37,3 +37,4 @@ opentelemetry-exporter-otlp>=1.26.0
 opentelemetry-semantic-conventions-ai>=0.4.1
 fuzzywuzzy==0.18.0
 aiperf==0.3.0
+nanobind>=2.9.0
--- a/setup.py
+++ b/setup.py
@ -128,7 +128,6 @@ else:
    ]

 package_data += [
-    'bindings.pyi',
    'bindings/*.pyi',
    'tools/plugin_gen/templates/*',
    'bench/build/benchmark_config.yml',
--- a/tensorrt_llm/_torch/attention_backend/sparse/dsa.py
+++ b/tensorrt_llm/_torch/attention_backend/sparse/dsa.py
@ -17,7 +17,7 @@ from tensorrt_llm._torch.modules.multi_stream_utils import \
    maybe_execute_in_parallel
 from tensorrt_llm._torch.modules.rotary_embedding import RotaryEmbedding
 from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
-from tensorrt_llm._torch.utils import maybe_compile
+from tensorrt_llm._torch.utils import maybe_compile, maybe_compiled_cat
 from tensorrt_llm._utils import get_size_in_bytes, get_sm_version
 from tensorrt_llm.bindings import DataType
 from tensorrt_llm.bindings.executor import KvCacheConfig
@ -1047,12 +1047,11 @@ class Indexer(nn.Module):
            # Indexer should just process the current MLA chunk as a single chunk
            has_mla_chunked_prefill = (
                metadata.enable_context_mla_with_cached_kv
-                and host_cached_tokens.sum().item() > 0
                and metadata.runtime_features.chunked_prefill)

            if has_mla_chunked_prefill:
-                # The MLA has already split the sequence, here just process what's given (as a single chunk)
-                # Cached token info is derived from metadata.host_ctx_cached_token_indptr in prepare_one_prefill_chunk
+                # MLA chunked prefill is active - use single-chunk pattern for
+                # indexer prefill chunks.
                chunk_specs = [(i, 0, host_seq_lens[i].item(),
                                host_seq_lens[:i].sum().item() if i > 0 else 0)
                               for i in range(num_contexts)]
@ -1063,7 +1062,8 @@ class Indexer(nn.Module):
                    )
                ]
            else:
-                # Normal mode: use indexer's own chunking logic to prevent L^2 complexity when long-sequence is used.
+                # Use indexer's own chunking logic to prevent L^2 complexity of indexer MQA logits computation for long sequences.
+                # This is only used when MLA chunked prefill is not enabled.
                chunk_groups = split_prefill_chunks(
                    host_seq_lens,
                    metadata.indexer_max_chunk_size,
@ -1541,7 +1541,7 @@ class Indexer(nn.Module):

    def _prep_q_or_k(self, qk_pe: torch.Tensor, qk_nope: torch.Tensor):
        """Concatenate, rotate, and FP8 quantize for Q or K"""
-        q_or_k = torch.cat([qk_pe, qk_nope], dim=-1)
+        q_or_k = maybe_compiled_cat([qk_pe, qk_nope], dim=-1)
        q_or_k = rotate_activation(q_or_k)
        q_or_k = q_or_k.view(-1, self.head_dim)
        q_or_k = fp8_utils.fp8_quantize_1x128_sf_transpose(
--- a/tensorrt_llm/_torch/models/modeling_qwen2vl.py
+++ b/tensorrt_llm/_torch/models/modeling_qwen2vl.py
@ -502,8 +502,10 @@ class Qwen2VisionModelBase(nn.Module):

 class Qwen2_5_VLVisionAttention(Attention):

-    def __init__(self, model_config: ModelConfig[PretrainedConfig],
-                 layer_idx: int) -> None:
+    def __init__(self,
+                 model_config: ModelConfig[PretrainedConfig],
+                 layer_idx: int,
+                 reduce_output: bool = True) -> None:

        config = model_config.pretrained_config.vision_config
        super().__init__(
@ -518,6 +520,7 @@ class Qwen2_5_VLVisionAttention(Attention):
            layer_idx=layer_idx,
            dtype=config.torch_dtype,
            config=model_config,
+            reduce_output=reduce_output,
        )

    def forward(
--- a/tensorrt_llm/_torch/models/modeling_qwen3vl.py
+++ b/tensorrt_llm/_torch/models/modeling_qwen3vl.py
@ -15,6 +15,7 @@ from transformers.models.qwen3_vl.modeling_qwen3_vl import (

 from tensorrt_llm._torch.models.modeling_multimodal_utils import _is_disagg
 from tensorrt_llm.functional import PositionEmbeddingType
+from tensorrt_llm.mapping import Mapping

 from ..._utils import nvtx_range, nvtx_range_debug
 from ...inputs import (
@ -439,7 +440,13 @@ class Qwen3VLVisionAttention(Qwen2_5_VLVisionAttention):
        model_config.pretrained_config.vision_config.torch_dtype = (
            model_config.pretrained_config.text_config.dtype
        )
-        super().__init__(model_config, layer_idx)
+        super().__init__(
+            model_config,
+            layer_idx=layer_idx,
+            reduce_output=(
+                not model_config.mapping.enable_attention_dp and model_config.mapping.tp_size > 1
+            ),
+        )


 class Qwen3VLVisionMLP(MLP):
@ -453,12 +460,14 @@ class Qwen3VLVisionMLP(MLP):
            dtype=model_config.pretrained_config.text_config.dtype,
            config=model_config,
            layer_idx=layer_idx,
+            overridden_tp_size=1 if model_config.mapping.enable_attention_dp else None,
        )


 class Qwen3VLVisionBlock(torch.nn.Module):
    def __init__(self, model_config: ModelConfig[PretrainedConfig], layer_idx: int):
        super().__init__()
+        self.model_config = model_config
        config = model_config.pretrained_config.vision_config

        self.norm1 = LayerNorm(
@ -510,11 +519,29 @@ class Qwen3VLVisionPatchMerger(torch.nn.Module):
            eps=model_config.pretrained_config.text_config.rms_norm_eps,
            dtype=model_config.pretrained_config.text_config.dtype,
        )
+
+        self.mapping = model_config.mapping
+        overridden_tp_size = 1 if model_config.mapping.enable_attention_dp else None
+        if overridden_tp_size is not None:
+            assert self.mapping.tp_size % overridden_tp_size == 0
+            tp_size = overridden_tp_size
+            # "Misuse" pp_size here to perform all-reduce within smaller groups
+            pp_size = self.mapping.pp_size * self.mapping.tp_size // overridden_tp_size
+            mapping = Mapping(
+                world_size=tp_size * pp_size,
+                rank=self.mapping.rank,
+                gpus_per_node=self.mapping.gpus_per_node,
+                tp_size=tp_size,
+                pp_size=pp_size,
+            )
+        else:
+            mapping = self.mapping
+
        self.linear_fc1 = Linear(
            in_features=self.hidden_size,
            out_features=self.hidden_size,
            bias=True,
-            mapping=model_config.mapping,
+            mapping=mapping,
            tensor_parallel_mode=TensorParallelMode.COLUMN,
            allreduce_strategy=model_config.allreduce_strategy,
        )
@ -523,7 +550,7 @@ class Qwen3VLVisionPatchMerger(torch.nn.Module):
            in_features=self.hidden_size,
            out_features=config.out_hidden_size,
            bias=True,
-            mapping=model_config.mapping,
+            mapping=mapping,
            tensor_parallel_mode=TensorParallelMode.ROW,
            allreduce_strategy=model_config.allreduce_strategy,
        )
@ -705,8 +732,8 @@ class Qwen3VisionModel(torch.nn.Module):

    @torch.inference_mode()
    def forward(
-        self, hidden_states: torch.Tensor, grid_thw: torch.Tensor, **kwargs
-    ) -> torch.Tensor:
+        self, pixel_values: torch.Tensor, grid_thw: torch.Tensor, **kwargs
+    ) -> Tuple[torch.Tensor, List[torch.Tensor]]:
        seq_lens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).tolist()
        attn_metadata = self.prepare_attn_metadata(seq_lens, self.attn_metadata)

@ -714,7 +741,7 @@ class Qwen3VisionModel(torch.nn.Module):
        rotary_pos_emb = self.rot_pos_emb(grid_thw)

        # From this point, pure GPU operation
-        hidden_states = self.patch_embed(hidden_states)
+        hidden_states = self.patch_embed(pixel_values)
        seq_len, _ = hidden_states.size()
        hidden_states = hidden_states.reshape(seq_len, -1)

--- a/tensorrt_llm/_torch/modules/attention.py
+++ b/tensorrt_llm/_torch/modules/attention.py
@ -25,7 +25,8 @@ from ..distributed import AllReduceParams, HelixAllToAllNative, alltoall_helix
 from ..model_config import ModelConfig
 from ..peft.lora.layer import LoraLayer, LoraModuleType
 from ..utils import (Fp4QuantizedTensor, get_model_extra_attrs,
-                     is_torch_compiling, maybe_compile)
+                     is_torch_compiling, maybe_compiled_cat,
+                     maybe_compiled_copy_)
 from .linear import Linear, TensorParallelMode, WeightMode, WeightsLoadingConfig
 from .multi_stream_utils import maybe_execute_in_parallel
 from .rms_norm import RMSNorm
@ -78,16 +79,6 @@ def extract_extra_attrs(layer_idx: str, attn_type: str):
    return metadata, attn_layer


-@maybe_compile
-def maybe_compiled_copy_(dst, src):
-    dst.copy_(src)
-
-
-@maybe_compile
-def maybe_compiled_cat(tensors, dim):
-    return torch.cat(tensors, dim)
-
-
 def create_attn_outputs_impl(q: torch.Tensor, attention_mask: str,
                             layer_idx: str) -> List[torch.Tensor]:
    metadata, attn_layer = extract_extra_attrs(layer_idx, "attn")
--- a/tensorrt_llm/_torch/modules/mlp.py
+++ b/tensorrt_llm/_torch/modules/mlp.py
@ -4,6 +4,8 @@ from typing import Optional
 import torch
 from torch import nn

+from tensorrt_llm.mapping import Mapping
+
 from ..model_config import ModelConfig
 from ..peft.lora.layer import LoraLayer, LoraModuleType
 from .linear import Linear, TensorParallelMode, WeightMode, WeightsLoadingConfig
@ -20,7 +22,8 @@ class MLP(nn.Module):
                 dtype: Optional[torch.dtype] = None,
                 config: Optional[ModelConfig] = None,
                 layer_idx: Optional[int] = None,
-                 reduce_output: bool = True):
+                 reduce_output: bool = True,
+                 overridden_tp_size: Optional[int] = None):

        super().__init__()
        self.layer_idx = layer_idx
@ -29,6 +32,22 @@ class MLP(nn.Module):
        self.activation = activation

        config = config or ModelConfig()
+        self.mapping = config.mapping
+        if overridden_tp_size is not None:
+            assert config.mapping.tp_size % overridden_tp_size == 0
+            tp_size = overridden_tp_size
+            # "Misuse" pp_size here to perform all-reduce within smaller groups
+            pp_size = config.mapping.pp_size * config.mapping.tp_size // overridden_tp_size
+            mapping = Mapping(
+                world_size=tp_size * pp_size,
+                rank=self.mapping.rank,
+                gpus_per_node=self.mapping.gpus_per_node,
+                tp_size=tp_size,
+                pp_size=pp_size,
+            )
+        else:
+            mapping = config.mapping
+
        self.up_lora = LoraLayer(
            [LoraModuleType.MLP_H_TO_4H],
            [self.intermediate_size // config.mapping.tp_size])
@ -38,7 +57,7 @@ class MLP(nn.Module):
            self.intermediate_size,
            bias=bias,
            dtype=dtype,
-            mapping=config.mapping,
+            mapping=mapping,
            tensor_parallel_mode=TensorParallelMode.COLUMN,
            weights_loading_config=WeightsLoadingConfig(
                weight_mode=WeightMode.VANILLA),
@ -55,7 +74,7 @@ class MLP(nn.Module):
            self.hidden_size,
            bias=bias,
            dtype=dtype,
-            mapping=config.mapping,
+            mapping=mapping,
            tensor_parallel_mode=TensorParallelMode.ROW,
            quant_config=config.get_quant_config(),
            skip_create_weights_in_init=config.skip_create_weights_in_init,
--- a/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
+++ b/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
@ -18,6 +18,7 @@ from ..speculative.eagle3 import Eagle3ResourceManager
 from ..speculative.mtp import SampleStateTensorsMTP
 from ..utils import make_weak_ref, piecewise_cuda_graph
 from .llm_request import get_draft_token_length
+from .mamba_cache_manager import MambaCacheManager
 from .resource_manager import (BaseResourceManager, ResourceManager,
                               ResourceManagerType)
 from .sampler import SampleStateTensors
@ -450,6 +451,11 @@ class CUDAGraphRunner:
            if spec_res_mgr:
                spec_res_mgr.add_dummy_requests([CUDA_GRAPH_DUMMY_REQUEST_ID])

+        # handle special cases of padding requests + MambaCacheManager or MambaHybridCacheManager
+        if isinstance(kv_cache_manager, MambaCacheManager):
+            kv_cache_manager.reorder_state_indices_when_padding_requests(
+                batch_size, padding_size)
+
        self.padding_dummy_request.py_draft_tokens = [0] * runtime_draft_len
        batch.generation_requests.extend([self.padding_dummy_request] *
                                         padding_size)
--- a/tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
+++ b/tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
@ -109,23 +109,59 @@ class MambaCacheManager(BaseResourceManager):
        self.state_indices: torch.Tensor = torch.arange(max_batch_size,
                                                        device=device,
                                                        dtype=torch.int32)
+        # save mamba state indices for requests
+        self.state_indices_list: List[int] = []

    def _prepare_mamba_cache_blocks(self, request_ids: List[int]):
-        state_indices = []
+        self.state_indices_list.clear()
        for r in request_ids:
            # cache hit
            if r in self.mamba_cache_index:
-                state_indices.append(self.mamba_cache_index[r])
+                self.state_indices_list.append(self.mamba_cache_index[r])
            # cache miss
            else:
                if len(self.mamba_cache_free_blocks) == 0:
                    raise Exception("run out of mamba cache blocks")
                block = self.mamba_cache_free_blocks.pop()
                self.mamba_cache_index[r] = block
-                state_indices.append(block)
-        self.state_indices[:len(state_indices)].copy_(torch.tensor(
-            state_indices, dtype=torch.int32, pin_memory=True),
-                                                      non_blocking=True)
+                self.state_indices_list.append(block)
+        self.state_indices[:len(self.state_indices_list)].copy_(
+            torch.tensor(self.state_indices_list,
+                         dtype=torch.int32,
+                         pin_memory=True),
+            non_blocking=True)
+
+    # When there exists padded requests, the state indices should not be repeated.
+    def reorder_state_indices_when_padding_requests(self, request_size,
+                                                    padding_size):
+        if padding_size == 0:
+            return
+
+        assert request_size + padding_size <= self.state_indices.numel(
+        ), "Padding requests run out of available mamba cache blocks"
+        # we can use mamba_cache_free_blocks for padding_requests
+        if padding_size <= len(self.mamba_cache_free_blocks):
+            self.state_indices[request_size:request_size +
+                               padding_size] = torch.tensor(
+                                   self.mamba_cache_free_blocks[:padding_size],
+                                   dtype=self.state_indices.dtype,
+                                   pin_memory=True).to(
+                                       self.state_indices.device,
+                                       non_blocking=True)
+        # But just finished requests won't free their used resources immediately
+        # In explicit, the running order is self.scheduler.schedule_request, self._forward_step() and self._process_previous_batch() in the PyExecutor.
+        # In this way, the current forward step will remove finished requests but will not remove mamba_cache immediately.
+        else:
+            all_mamba_cache_indices = set(range(self.state_indices.numel()))
+            allocated_indices = set(self.state_indices_list)
+            free_indices = list(all_mamba_cache_indices - allocated_indices)
+            self.state_indices[request_size:request_size +
+                               padding_size] = torch.tensor(
+                                   free_indices[:padding_size],
+                                   dtype=self.state_indices.dtype,
+                                   pin_memory=True).to(
+                                       self.state_indices.device,
+                                       non_blocking=True)

    def prepare_resources(self, scheduled_batch: ScheduledRequests):
        context_ids = [
--- a/tensorrt_llm/_torch/pyexecutor/py_executor.py
+++ b/tensorrt_llm/_torch/pyexecutor/py_executor.py
@ -250,6 +250,9 @@ class PyExecutor:
        self.send_schedule_handler = None
        self.pp_scheduler_max_retry_count = int(
            os.environ.get("TLLM_PP_SCHEDULER_MAX_RETRY_COUNT", 10))
+        self.sample_stream = torch.cuda.Stream()
+        self.start_sample_event = torch.cuda.Event()
+        self.finish_sample_event = torch.cuda.Event()

        # Set of request IDs that are currently in flight across all micro batches.
        # The scheduler will avoid scheduling requests that are already in flight.
@ -1068,8 +1071,25 @@ class PyExecutor:
                                guided_decoder_failed_requests = self.guided_decoder.execute(
                                    batch_outputs['logits'])

-                            sample_state = self._sample_async(
-                                scheduled_batch, batch_outputs)
+                            if os.environ.get("TRTLLM_PP_MULTI_STREAM_SAMPLE",
+                                              "1") == "1":
+                                # Wait for the previous sample to finish.
+                                self.finish_sample_event.wait()
+                                # Copy the batch outputs as sampler inputs
+                                # to avoid next forward step overwriting them.
+                                batch_outputs_copy = {
+                                    name: tensor.clone()
+                                    for name, tensor in batch_outputs.items()
+                                }
+                                self.start_sample_event.record()
+                                with torch.cuda.stream(self.sample_stream):
+                                    self.start_sample_event.wait()
+                                    sample_state = self._sample_async(
+                                        scheduled_batch, batch_outputs_copy)
+                                    self.finish_sample_event.record()
+                            else:
+                                sample_state = self._sample_async(
+                                    scheduled_batch, batch_outputs)
                            assert sample_state is not None, "Sampling failed"

                            # Handle guided decoder errors after _sample_async to avoid state conflicts.
--- a/tensorrt_llm/_torch/utils.py
+++ b/tensorrt_llm/_torch/utils.py
@ -404,3 +404,13 @@ def split(x: torch.Tensor,

 def relu2(x: torch.Tensor) -> torch.Tensor:
    return torch.square(F.relu(x))
+
+
+@maybe_compile
+def maybe_compiled_copy_(dst, src):
+    dst.copy_(src)
+
+
+@maybe_compile
+def maybe_compiled_cat(tensors, dim):
+    return torch.cat(tensors, dim)
--- a/tensorrt_llm/inputs/registry.py
+++ b/tensorrt_llm/inputs/registry.py
@ -356,6 +356,20 @@ class BaseMultimodalDummyInputsBuilder(ABC):
    def get_dummy_prompt(self, input_seq_len: int):
        # TODO(yechank): We use the max resolution as starting point and keep reducing the resolution until the prompt length is less than the input sequence length.
        # Need to find better way to calculate the dummy prompt length as this iteration may not be efficient.
+
+        # Use the registered model_type from the decorator if available,
+        # otherwise fall back to HuggingFace config's model_type.
+        # This ensures consistency between placeholder registration and lookup.
+        registered_model_type = getattr(self.__class__,
+                                        '_registered_model_type', None)
+        config_model_type = self.config.model_type
+        model_type = registered_model_type or config_model_type
+
+        logger.debug(
+            f"[get_dummy_prompt] registered_model_type={registered_model_type}, "
+            f"config.model_type={config_model_type}, using model_type={model_type}"
+        )
+
        while self.image_max_dim >= self.image_min_dim:
            image = self.get_dummy_image(max_width=self.image_max_dim,
                                         max_height=self.image_max_dim)
@ -363,7 +377,7 @@ class BaseMultimodalDummyInputsBuilder(ABC):
            test_mm_prompt = tensorrt_llm.inputs.utils.default_multimodal_input_loader(
                tokenizer=self.tokenizer,
                model_dir=self.model_path,
-                model_type=self.config.model_type,
+                model_type=model_type,
                modality="image",
                prompts=[""],
                media=[[image]],
@ -565,6 +579,9 @@ def register_input_processor(
        MULTIMODAL_PLACEHOLDER_REGISTRY.set_placeholder_metadata(
            model_type, placeholder_metadata)

+        # Store model_type on processor class for use in get_dummy_prompt
+        processor_cls._registered_model_type = model_type
+
        return model_cls

    return wrapper
--- a/tests/integration/defs/accuracy/references/mmmu.yaml
+++ b/tests/integration/defs/accuracy/references/mmmu.yaml
@ -27,3 +27,5 @@ Qwen/Qwen3-VL-30B-A3B-Instruct:
 mistral/Mistral-Large-3-675B:
 # Mistral Large 3 675B only supports single image input, so accuracy is lower.
  - accuracy: 47
+Qwen/Qwen3-VL-8B-Instruct:
+  - accuracy: 55.11
--- a/tests/integration/defs/accuracy/test_llm_api.py
+++ b/tests/integration/defs/accuracy/test_llm_api.py
@ -512,6 +512,7 @@ class TestEagle2Vicuna_7B_v1_3(LlmapiAccuracyTestHarness):
            task.evaluate(llm)


+@pytest.mark.skip_device_not_contain(["A100", "H100"])
 class TestStarCoder2_7B(LlmapiAccuracyTestHarness):
    MODEL_NAME = "bigcode/starcoder2-7b"
    MODEL_PATH = f"{llm_models_root()}/starcoder2-7b"
--- a/tests/integration/defs/accuracy/test_llm_api_pytorch.py
+++ b/tests/integration/defs/accuracy/test_llm_api_pytorch.py
@ -4846,10 +4846,12 @@ class TestQwen3NextInstruct(LlmapiAccuracyTestHarness):
        model_path = f"{self.MODEL_PATH}/Qwen3-Next-80B-A3B-Instruct"
        kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.6,
                                        enable_block_reuse=False)
-        pytorch_config = dict(disable_overlap_scheduler=not overlap_scheduler,
-                              cuda_graph_config=CudaGraphConfig(
-                                  max_batch_size=512, enable_padding=True)
-                              if cuda_graph else None)
+        pytorch_config = dict(
+            disable_overlap_scheduler=not overlap_scheduler,
+            cuda_graph_config=CudaGraphConfig(
+                enable_padding=True,
+                batch_sizes=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048])
+            if cuda_graph else None)

        with LLM(
                model_path,
@ -4864,6 +4866,7 @@ class TestQwen3NextInstruct(LlmapiAccuracyTestHarness):
            task.evaluate(llm)
            mocker.patch.object(GSM8K, "MAX_OUTPUT_LEN",
                                self.GSM8K_MAX_OUTPUT_LEN)
+            mocker.patch.object(GSM8K, "NUM_SAMPLES", 1319)
            task = GSM8K(self.MODEL_NAME)
            task.evaluate(llm)

--- a/tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
+++ b/tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py
@ -327,3 +327,21 @@ class TestMistralLarge3_675B(LlmapiAccuracyTestHarness):
        ) as llm:
            task = MMMU(self.MODEL_NAME)
            task.evaluate(llm, sampling_params=self.sampling_params)
+
+
+class TestQwen3VL(LlmapiAccuracyTestHarness):
+    MODEL_NAME = "Qwen/Qwen3-VL-8B-Instruct"
+    MODEL_PATH = f"{llm_models_root()}/Qwen3/Qwen3-VL-8B-Instruct"
+    MAX_NUM_TOKENS = 16384
+
+    sampling_params = SamplingParams(
+        max_tokens=MAX_NUM_TOKENS, truncate_prompt_tokens=MMMU.MAX_INPUT_LEN, stop="<|endoftext|>"
+    )
+
+    def test_auto_dtype(self):
+        with LLM(
+            self.MODEL_PATH,
+            max_num_tokens=self.MAX_NUM_TOKENS,
+        ) as llm:
+            task = MMMU(self.MODEL_NAME)
+            task.evaluate(llm, sampling_params=self.sampling_params)
--- a/tests/integration/defs/examples/test_llama.py
+++ b/tests/integration/defs/examples/test_llama.py
@ -3203,6 +3203,7 @@ def test_llm_llama_v3_2_smoothquant_1node_single_gpu(


@pytest.mark.timeout(7200)
+@pytest.mark.skip_device_not_contain(["A100", "H100"])
@pytest.mark.skip_less_device_memory(80000)
@pytest.mark.skip_less_device(4)
@skip_post_blackwell_ultra
--- a/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml
+++ b/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx1_gen1_dep32_bs32_eplb288_mtp0_ccb-NIXL.yaml
@ -43,13 +43,12 @@ environment:
 profiling:
  nsys_on: false
 accuracy:
-  enable_accuracy_test: false
+  enable_accuracy_test: false  # Set to true to enable accuracy evaluation
  model: local-completions
  tasks: gsm8k
  model_args_extra: num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
 worker_config:
  gen:
-    enable_layerwise_nvtx_marker: true
    tensor_parallel_size: 32
    moe_expert_parallel_size: 32
    enable_attention_dp: true
@ -80,17 +79,20 @@ worker_config:
      free_gpu_memory_fraction: 0.9
      dtype: fp8
    moe_config:
-      backend: WIDEEP
-      load_balancer:
-        num_slots: 288
-        layer_updates_per_iter: 1
+      backend: CUTEDSL
+      use_low_precision_moe_combine: true
+    nvfp4_gemm_config:
+      allowed_backends:
+      - cutlass
+      - cublaslt
+      - cutedsl
+      - cuda_core
    cache_transceiver_config:
      max_tokens_in_buffer: 4608
-      backend: NIXL
+      backend: NIXLf
    stream_interval: 20
    num_postprocess_workers: 4
  ctx:
-    enable_layerwise_nvtx_marker: true
    max_batch_size: 4
    max_num_tokens: 4608
    max_seq_len: 2251
@ -101,6 +103,8 @@ worker_config:
    print_iter_log: true
    cuda_graph_config: null
    disable_overlap_scheduler: true
+    moe_config:
+      backend: TRTLLM
    kv_cache_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.85
--- a/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
+++ b/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml
@ -49,7 +49,6 @@ accuracy:
  model_args_extra: num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
 worker_config:
  gen:
-    enable_layerwise_nvtx_marker: true
    tensor_parallel_size: 16
    moe_expert_parallel_size: 16
    enable_attention_dp: true
@ -80,10 +79,14 @@ worker_config:
      free_gpu_memory_fraction: 0.9
      dtype: fp8
    moe_config:
-      backend: WIDEEP
-      load_balancer:
-        num_slots: 288
-        layer_updates_per_iter: 1
+      backend: CUTEDSL
+      use_low_precision_moe_combine: true
+    nvfp4_gemm_config:
+      allowed_backends:
+      - cutlass
+      - cublaslt
+      - cutedsl
+      - cuda_core
    cache_transceiver_config:
      max_tokens_in_buffer: 4608
      backend: NIXL
@ -93,7 +96,6 @@ worker_config:
      decoding_type: MTP
      num_nextn_predict_layers: 3
  ctx:
-    enable_layerwise_nvtx_marker: true
    max_batch_size: 4
    max_num_tokens: 4608
    max_seq_len: 2251
@ -104,6 +106,8 @@ worker_config:
    print_iter_log: true
    cuda_graph_config: null
    disable_overlap_scheduler: true
+    moe_config:
+      backend: TRTLLM
    kv_cache_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.85
--- a/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
+++ b/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_1k1k_ctx2_gen1_dep48_bs16_eplb288_mtp3_ccb-DEFAULT.yaml
@ -50,7 +50,6 @@ accuracy:
  model_args_extra: num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
 worker_config:
  gen:
-    enable_layerwise_nvtx_marker: true
    tensor_parallel_size: 48
    moe_expert_parallel_size: 48
    enable_attention_dp: true
@ -81,16 +80,19 @@ worker_config:
      free_gpu_memory_fraction: 0.7
      dtype: fp8
    moe_config:
-      backend: WIDEEP
-      load_balancer:
-        num_slots: 288
-        layer_updates_per_iter: 1
+      backend: CUTEDSL
+      use_low_precision_moe_combine: true
+    nvfp4_gemm_config:
+      allowed_backends:
+      - cutlass
+      - cublaslt
+      - cutedsl
+      - cuda_core
    cache_transceiver_config:
      max_tokens_in_buffer: 8320
      backend: DEFAULT
    stream_interval: 20
  ctx:
-    enable_layerwise_nvtx_marker: true
    max_batch_size: 4
    max_num_tokens: 4480
    max_seq_len: 2176
@ -101,6 +103,8 @@ worker_config:
    print_iter_log: true
    cuda_graph_config: null
    disable_overlap_scheduler: true
+    moe_config:
+      backend: TRTLLM
    kv_cache_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.85
--- a/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
+++ b/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx2_gen1_dep32_bs128_eplb288_mtp3_ccb-DEFAULT.yaml
@ -49,7 +49,6 @@ accuracy:
  model_args_extra: num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
 worker_config:
  gen:
-    enable_layerwise_nvtx_marker: true
    tensor_parallel_size: 32
    moe_expert_parallel_size: 32
    enable_attention_dp: true
@ -81,10 +80,14 @@ worker_config:
      free_gpu_memory_fraction: 0.6
      dtype: fp8
    moe_config:
-      backend: WIDEEP
-      load_balancer:
-        num_slots: 288
-        layer_updates_per_iter: 1
+      backend: CUTEDSL
+      use_low_precision_moe_combine: true
+    nvfp4_gemm_config:
+      allowed_backends:
+      - cutlass
+      - cublaslt
+      - cutedsl
+      - cuda_core
    cache_transceiver_config:
      max_tokens_in_buffer: 8448
      backend: DEFAULT
@ -94,7 +97,6 @@ worker_config:
      decoding_type: MTP
      num_nextn_predict_layers: 3
  ctx:
-    enable_layerwise_nvtx_marker: true
    max_batch_size: 1
    max_num_tokens: 8448
    max_seq_len: 9423
@ -109,6 +111,8 @@ worker_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.75
      dtype: fp8
+    moe_config:
+      backend: TRTLLM
    cache_transceiver_config:
      max_tokens_in_buffer: 8448
      backend: DEFAULT
--- a/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml
+++ b/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx6_gen1_dep16_bs64_eplb288_mtp0_ccb-NIXL.yaml
@ -49,7 +49,6 @@ accuracy:
  model_args_extra: num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
 worker_config:
  gen:
-    enable_layerwise_nvtx_marker: true
    tensor_parallel_size: 16
    moe_expert_parallel_size: 16
    enable_attention_dp: true
@ -80,17 +79,20 @@ worker_config:
      free_gpu_memory_fraction: 0.7
      dtype: fp8
    moe_config:
-      backend: WIDEEP
-      load_balancer:
-        num_slots: 288
-        layer_updates_per_iter: 1
+      backend: CUTEDSL
+      use_low_precision_moe_combine: true
+    nvfp4_gemm_config:
+      allowed_backends:
+      - cutlass
+      - cublaslt
+      - cutedsl
+      - cuda_core
    cache_transceiver_config:
      max_tokens_in_buffer: 8448
      backend: NIXL
    stream_interval: 20
    num_postprocess_workers: 4
  ctx:
-    enable_layerwise_nvtx_marker: true
    max_batch_size: 1
    max_num_tokens: 8448
    max_seq_len: 9419
@ -105,6 +107,8 @@ worker_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.75
      dtype: fp8
+    moe_config:
+      backend: TRTLLM
    cache_transceiver_config:
      max_tokens_in_buffer: 8448
      backend: NIXL
--- a/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml
+++ b/tests/integration/defs/perf/disagg/test_configs/wideep/perf/deepseek-v32-fp4_8k1k_ctx8_gen1_dep32_bs16_eplb288_mtp3_ccb-NIXL.yaml
@ -49,7 +49,6 @@ accuracy:
  model_args_extra: num_concurrent=512,max_retries=3,tokenized_requests=false,timeout=1200,max_gen_toks=256,max_length=4096
 worker_config:
  gen:
-    enable_layerwise_nvtx_marker: true
    tensor_parallel_size: 32
    moe_expert_parallel_size: 32
    enable_attention_dp: true
@ -80,10 +79,14 @@ worker_config:
      free_gpu_memory_fraction: 0.7
      dtype: fp8
    moe_config:
-      backend: WIDEEP
-      load_balancer:
-        num_slots: 288
-        layer_updates_per_iter: 1
+      backend: CUTEDSL
+      use_low_precision_moe_combine: true
+    nvfp4_gemm_config:
+      allowed_backends:
+      - cutlass
+      - cublaslt
+      - cutedsl
+      - cuda_core
    cache_transceiver_config:
      max_tokens_in_buffer: 8448
      backend: NIXL
@ -93,7 +96,6 @@ worker_config:
      decoding_type: MTP
      num_nextn_predict_layers: 3
  ctx:
-    enable_layerwise_nvtx_marker: true
    max_batch_size: 1
    max_num_tokens: 8448
    max_seq_len: 9419
@ -108,6 +110,8 @@ worker_config:
      enable_block_reuse: false
      free_gpu_memory_fraction: 0.75
      dtype: fp8
+    moe_config:
+      backend: TRTLLM
    cache_transceiver_config:
      max_tokens_in_buffer: 8448
      backend: NIXL
--- a/tests/integration/defs/perf/pytorch_model_config.py
+++ b/tests/integration/defs/perf/pytorch_model_config.py
@ -59,7 +59,7 @@ def get_model_yaml_config(model_label: str,
    pattern_configs = [
        # Deepseek default cases
        {
-            'patterns': 'deepseek_r1',
+            'patterns': ['deepseek_r1', 'kimi_k2_nvfp4'],
            'config': {
                'enable_attention_dp': True,
            }
--- a/tests/integration/defs/perf/test_perf.py
+++ b/tests/integration/defs/perf/test_perf.py
@ -144,6 +144,7 @@ MODEL_PATH_DICT = {
    "gpt_oss_20b_fp4": "gpt_oss/gpt-oss-20b",
    "nemotron_nano_9b_v2": "NVIDIA-Nemotron-Nano-12B-v2",
    "starcoder2_7b": "starcoder2-7b",
+    "kimi_k2_nvfp4": "Kimi-K2-Thinking-NVFP4",
 }
 # Model PATH of HuggingFace
 HF_MODEL_PATH = {
--- a/tests/integration/test_lists/qa/llm_function_nim.txt
+++ b/tests/integration/test_lists/qa/llm_function_nim.txt
@ -1,9 +1,12 @@
-# TRT Backend Tests
-examples/test_llama.py::test_llama_3_x_fp8_with_bf16_lora[llama-3.1-8b]
+# TRT Backend Tests (Llama 3.1/3.3 70B + StarCoder2-7B only)
 examples/test_llama.py::test_llm_llama_v3_1_1node_multi_gpus[enable_gemm_allreduce_plugin-llama-3.1-70b-disable_fp8]
 examples/test_llama.py::test_llm_llama_v3_1_1node_multi_gpus[disable_gemm_allreduce_plugin-llama-3.1-70b-enable_fp8]
-examples/test_llama.py::test_llm_llama_1gpu_fp4[llama-3.1-70b-instruct-enable_norm_quant_fusion-enable_fused_quant-fp4_plugin-bfloat16]
 examples/test_llama.py::test_llm_llama_2gpu_fp4[llama-3.1-70b-instruct-fp4_plugin]
+accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
+accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_nvfp4_prequantized_tp4
+accuracy/test_llm_api.py::TestStarCoder2_7B::test_auto_dtype
+accuracy/test_llm_api.py::TestStarCoder2_7B::test_fp8
+
 # serve tests
 examples/serve/test_serve.py::test_config_file_loading[--extra_llm_api_options]
 examples/serve/test_serve.py::test_config_file_loading[--config]
@ -24,24 +27,6 @@ examples/serve/test_serve_negative.py::test_malformed_json_request
 examples/serve/test_serve_negative.py::test_missing_content_type_header
 examples/serve/test_serve_negative.py::test_extremely_large_batch

-# Accuracy test list
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_auto_dtype
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_fp8
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_tp4[disable_gemm_allreduce_plugin]
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_tp4[enable_gemm_allreduce_plugin]
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_fp8_rowwise_tp4[disable_gemm_allreduce_plugin]
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_fp8_rowwise_tp4[enable_gemm_allreduce_plugin]
-accuracy/test_cli_flow.py::TestLlama3_1_8B::test_autoq
-accuracy/test_cli_flow.py::TestLlama3_1_8BInstruct::test_auto_dtype
-accuracy/test_cli_flow.py::TestLlama3_1_8BInstruct::test_fp8_prequantized
-accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
-accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_nvfp4_prequantized_tp4
-
-accuracy/test_llm_api.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
-accuracy/test_llm_api.py::TestLlama3_1_8BInstruct::test_guided_decoding_4gpus[xgrammar]
-accuracy/test_llm_api.py::TestLlama3_1_8BInstruct::test_gather_generation_logits_cuda_graph
-accuracy/test_llm_api.py::TestLlama3_1_8B::test_fp8_rowwise
-accuracy/test_llm_api.py::TestLlama3_1_8BInstruct::test_logprobs
 # PyTorch Backend Tests
 accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
 accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_prequantized
@ -225,9 +210,6 @@ accuracy/test_llm_api_pytorch.py::TestLlama3_1NemotronNano8Bv1::test_fp8_prequan
 accuracy/test_llm_api_pytorch.py::TestNemotronH::test_auto_dtype[cuda_graph=True]
 accuracy/test_llm_api_pytorch.py::TestNemotronH::test_auto_dtype[cuda_graph=False]
 accuracy/test_llm_api_pytorch.py::TestNemotronH::test_reasoning_fp8_prequantized[cuda_graph=True]
-accuracy/test_llm_api_pytorch.py::TestNemotronH_47B_Base::test_auto_dtype[tp8ep4-cuda_graph=True]
-accuracy/test_llm_api_pytorch.py::TestNemotronH_47B_Base::test_reasoning_fp8_prequantized[tp8ep8-cuda_graph=True]
-accuracy/test_llm_api_pytorch.py::TestNemotronH_56B_Base::test_auto_dtype[tp8-cuda_graph=True]
 accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
 accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
 accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=True]
@ -294,6 +276,7 @@ llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_type_tensorrt
 llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_type_default

 # keep test cases associated open bugs
+examples/test_llama.py::test_llm_llama_1gpu_fp4[llama-3.1-70b-instruct-enable_norm_quant_fusion-enable_fused_quant-fp4_plugin-bfloat16]
 examples/test_nemotron.py::test_llm_nemotron_4_15b_2gpus[bfloat16-full_prec]
 examples/test_nemotron.py::test_llm_nemotron_4_15b_2gpus[bfloat16-fp8]
 examples/test_nemotron.py::test_llm_nemotron_4_15b_2gpus[bfloat16-int4_awq]
--- a/tests/integration/test_lists/qa/llm_perf_core.yml
+++ b/tests/integration/test_lists/qa/llm_perf_core.yml
@ -14,10 +14,11 @@ llm_perf_core:
 # 9: H100, H20, H200, GB200, B200, B300, GB300, RTX6000-D, RTX6000-Server test cases
 # 10: GB200, B200, B300, GB300, RTX6000-Server test cases
 # 11: B200, GB200, B300, GB300 test cases
-# 12: H100, H20, H200, B200, B300 test cases
-# 13: H100, H20, H200, B200, B300, RTX-6000 Server test cases
-# 14: RTX-6000D, RTX-6000 Server test cases
-# 15: RTX6000-Server test cases
+# 12: B200, B300 test cases
+# 13: H100, H20, H200, B200, B300 test cases
+# 14: H100, H20, H200, B200, B300, RTX-6000 Server test cases
+# 15: RTX-6000D, RTX-6000 Server test cases
+# 16: RTX6000-Server test cases
 # ===============================================================================


@ -289,7 +290,21 @@ llm_perf_core:
  - perf/test_perf.py::test_perf[deepseek_r1_nvfp4-bench-pytorch-float4-maxbs:1000-maxnt:5000-kv_frac:0.85-input_output_len:5000,500-reqs:2000-ep:4-tp:4-gpus:4] TIMEOUT(120)
  - perf/test_perf.py::test_perf[deepseek_r1_nvfp4-bench-pytorch-float4-maxbs:32-maxnt:32768-input_output_len:8192,1024-reqs:20-con:1-ep:1-tp:4-gpus:4] TIMEOUT(120)

-# 12: H100, H20, H200, B200, B300 test cases
+# 12: B200, B300 test cases
+- condition:
+    ranges:
+      system_gpu_count:
+        gte: 8
+      compute_capability:
+        gte: 10.0
+        lte: 10.3
+  tests:
+  - perf/test_perf.py::test_perf[kimi_k2_nvfp4-bench-pytorch-float4-maxbs:16-input_output_len:128,128-reqs:20-con:1-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[kimi_k2_nvfp4-bench-pytorch-float4-maxbs:256-input_output_len:2000,500-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[kimi_k2_nvfp4-bench-pytorch-float4-maxbs:512-maxnt:2048-kv_frac:0.6-input_output_len:1000,1000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[kimi_k2_nvfp4-bench-pytorch-float4-maxbs:512-maxnt:2048-kv_frac:0.6-input_output_len:5000,500-reqs:2000-ep:8-tp:8-gpus:8] TIMEOUT(120)
+
+# 13: H100, H20, H200, B200, B300 test cases
 - condition:
    ranges:
      system_gpu_count:
@ -356,7 +371,7 @@ llm_perf_core:
  - perf/test_perf.py::test_perf[qwen3_235b_a22b_fp8-bench-pytorch-float8-input_output_len:1000,2000-con:256-ep:8-gpus:8] TIMEOUT(60)


-# 13: H100, H20, H200, B200, B300, RTX-6000 Server test cases
+# 14: H100, H20, H200, B200, B300, RTX-6000 Server test cases
 - condition:
    ranges:
      system_gpu_count:
@ -368,7 +383,7 @@ llm_perf_core:
  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8]


-# 14: RTX-6000D, RTX-6000 Server test cases
+# 15: RTX-6000D, RTX-6000 Server test cases
 - condition:
    ranges:
      system_gpu_count:
@ -402,7 +417,7 @@ llm_perf_core:
  - perf/test_perf.py::test_perf[mixtral_8x7b_v0.1_instruct_fp4-bench-pytorch-float4-input_output_len:128,128-kv_cache_dtype:fp8-tp:2-gpus:2]


-# 15: RTX6000-Server test cases
+# 16: RTX6000-Server test cases
 - condition:
    ranges:
      system_gpu_count:
--- a/tests/integration/test_lists/test-db/l0_l40s.yml
+++ b/tests/integration/test_lists/test-db/l0_l40s.yml
@ -14,6 +14,7 @@ l0_l40s:
      backend: pytorch
  tests:
  # ------------- PyTorch tests ---------------
+  # Multimodal modeling tests
  - unittest/_torch/modeling -k "modeling_mllama"
  - unittest/_torch/modeling -k "modeling_siglip"
  - unittest/_torch/modeling -k "modeling_vila"
@ -22,6 +23,7 @@ l0_l40s:
  - unittest/_torch/modeling/test_modeling_llava_next.py::TestLlavaNext::test_all
  - unittest/_torch/modeling/test_modeling_qwen2_5vl.py::TestQwen2_5_VL::test_all
  - unittest/_torch/modeling/test_modeling_qwen3vl_moe.py::TestQwen3VLMoe::test_all
+  - unittest/_torch/modeling/test_modeling_qwen3vl.py::TestQwen3VL::test_all
  - test_e2e.py::test_ptp_scaffolding[DeepSeek-R1-Distill-Qwen-7B-DeepSeek-R1/DeepSeek-R1-Distill-Qwen-7B]
  - test_e2e.py::test_ptp_quickstart_multimodal_phi4mm[phi4-multimodal-instruct-multimodals/Phi-4-multimodal-instruct-audio]
  - test_e2e.py::test_ptp_quickstart_multimodal_phi4mm[phi4-multimodal-instruct-multimodals/Phi-4-multimodal-instruct-image]
--- a/tests/integration/test_lists/waives.txt
+++ b/tests/integration/test_lists/waives.txt
@ -316,8 +316,6 @@ accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16_4gpus[t
 examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3-small-128k-instruct-fp8-bfloat16] SKIP (https://nvbugs/5465143)
 examples/test_multimodal.py::test_llm_multimodal_general[Mistral-Small-3.1-24B-Instruct-2503-pp:1-tp:1-bfloat16-bs:8-cpp_e2e:False-nb:1] SKIP (https://nvbugs/5644684)
 accuracy/test_llm_api_pytorch.py::TestLlama3_1NemotronNano8Bv1::test_fp8_prequantized SKIP (https://nvbugs/5640697)
-accuracy/test_llm_api_pytorch.py::TestNemotronH_47B_Base::test_auto_dtype[tp8ep4-cuda_graph=True] SKIP (https://nvbugs/5640697)
-accuracy/test_llm_api_pytorch.py::TestNemotronH_47B_Base::test_reasoning_fp8_prequantized[tp8ep8-cuda_graph=True] SKIP (https://nvbugs/5640697)
 accuracy/test_llm_api_pytorch.py::TestQwQ_32B::test_auto_dtype_tp4 SKIP (https://nvbugs/5640697)
 test_e2e.py::test_ptp_quickstart_multimodal[mistral-small-3.1-24b-instruct-Mistral-Small-3.1-24B-Instruct-2503-image-True] SKIP (https://nvbugs/5648560)
 test_e2e.py::test_ptp_quickstart_multimodal[mistral-small-3.1-24b-instruct-Mistral-Small-3.1-24B-Instruct-2503-image-False] SKIP (https://nvbugs/5648560)
@ -371,7 +369,6 @@ accuracy/test_llm_api_pytorch_multimodal.py::TestPhi4MMFusedVisionLora::test_aut
 disaggregated/test_disaggregated.py::test_disaggregated_ctxtp2pp2_gentp2pp2[TinyLlama-1.1B-Chat-v1.0] SKIP (https://nvbugs/5705199)
 accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2 SKIP (https://nvbugs/5707145)
 accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2 SKIP (https://nvbugs/5707145)
-accuracy/test_llm_api_pytorch.py::TestNemotronH_56B_Base::test_auto_dtype[tp8-cuda_graph=True] SKIP (https://nvbugs/5640697)
 accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True] SKIP (https://nvbugs/5707145)
 accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=True] SKIP (https://nvbugs/5707145)
 accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_chunked_prefill[cutlass-auto] SKIP (https://nvbugs/5596343)
@ -420,7 +417,6 @@ examples/test_qwen.py::test_llm_qwen_7b_int8_kv_1node_1gpus[qwen2.5_7b_chat-enab
 examples/test_qwenvl.py::test_llm_qwenvl_single_gpu_summary[qwen-vl-chat] SKIP (https://nvbugs/5754976)
 examples/test_whisper.py::test_llm_whisper_general[large-v3-disable_gemm_plugin-enable_attention_plugin-int8-float16-nb:1-use_cpp_runtime] SKIP (https://nvbugs/5568052)
 accuracy/test_llm_api_pytorch_multimodal.py::TestQwen3VL_MOE::test_auto_dtype SKIP (https://nvbugs/5588376)
-accuracy/test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_tp_pp_symmetric[MMLU-tp2pp2] SKIP (https://nvbugs/5756008)
 unittest/_torch/speculative/test_dynamic_spec_decode.py::test_dynamic_spec_decode SKIP (https://nvbugs/5758449)
 unittest/executor/test_base_worker.py::TestWorkerBase SKIP (https://nvbugs/5759698)
 triton_server/test_triton.py::test_gpt_disaggregated_serving_bls[gpt-disaggregated-serving-bls] SKIP (https://nvbugs/5582118)
@ -498,7 +494,6 @@ unittest/_torch/ray_orchestrator/multi_gpu/test_multi_instance.py::test_multi_in
 disaggregated/test_auto_scaling.py::test_worker_restart[etcd-round_robin] SKIP (https://nvbugs/5776445)
 accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_eagle3_vswa_reuse_4gpus[one_model] SKIP (https://nvbugs/5756028)
 accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_eagle3_vswa_reuse_4gpus[two_model] SKIP (https://nvbugs/5756028)
-accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8[latency-torch_compile=False] SKIP (https://nvbugs/5785206)
 examples/test_gpt.py::test_llm_gpt2_parallel_embedding_2gpu[float16-0] SKIP (https://nvbugs/5784518)
 accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_prequantized SKIP (https://nvbugs/5785465)
 accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_fp8 SKIP (https://nvbugs/5785485)
--- a/tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py
+++ b/tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py
@ -102,7 +102,6 @@ def test_generate_only_with_slot_mapping_cuda(conv_env):
    )


-@pytest.mark.skip(reason="https://nvbugspro.nvidia.com/bug/5548861")
 def test_context_flattened_and_state_writeback_cuda(conv_env):
    device = conv_env["device"]
    dtype = conv_env["dtype"]
@ -124,18 +123,26 @@ def test_context_flattened_and_state_writeback_cuda(conv_env):
        dtype=dtype,
    )

-    seq_len = torch.tensor(lens, device=device, dtype=torch.int32)
-    seq_start = torch.tensor([0, lens[0]], device=device, dtype=torch.int32)
+    # batch_info_host: [num_prefill, num_prefill_tokens, num_decode]
+    num_prefill = len(lens)
+    batch_info_host = torch.tensor([num_prefill, total, 0], device=device, dtype=torch.int32)
+    cu_seqlen = torch.tensor([0, lens[0], total], device=device, dtype=torch.int32)
+    use_initial_states = torch.zeros(num_prefill, device=device, dtype=torch.bool)

-    y = torch.ops.auto_deploy.cuda_cached_causal_conv1d(
+    # Snapshot input for reference before running op (op mutates x)
+    x_ref = x.clone()
+
+    # Run CUDA cached op (modifies x in-place and returns None)
+    torch.ops.auto_deploy.cuda_cached_causal_conv1d(
        # INPUTS
        x,
        w,
        b,
-        # METADATA
-        seq_len,
-        seq_start,
+        # STANDARD METADATA
+        batch_info_host,
+        cu_seqlen,
        slot_idx,
+        use_initial_states,
        # CACHES
        conv_state_cache,
        # CONSTANTS
@ -144,7 +151,9 @@ def test_context_flattened_and_state_writeback_cuda(conv_env):
        d,
        g,
        pm,
+        None,
    )
+    y = x  # The op modifies x in-place

    assert y.shape == (batch, seq, c)
    assert torch.isfinite(y).all()
@ -153,9 +162,9 @@ def test_context_flattened_and_state_writeback_cuda(conv_env):
    y_ref = torch.empty_like(y)
    for i, ln in enumerate(lens):
        st = 0 if i == 0 else lens[0]
-        x_i = x[:, st : st + ln]
+        x_i = x_ref[:, st : st + ln]
        y_i, _ = (
-            tensorrt_llm._torch.auto_deploy.custom_ops.torch_backend_causal_conv._torch_causal_conv1d_prefill(  # type: ignore  # noqa: E501
+            tensorrt_llm._torch.auto_deploy.custom_ops.mamba.torch_backend_causal_conv._torch_causal_conv1d_prefill(  # type: ignore  # noqa: E501
                x_i, w, b, s, p, d, g, pm
            )
        )
--- a/tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py
+++ b/tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py
@ -5,13 +5,15 @@ import tensorrt_llm._torch.auto_deploy  # noqa: F401


 def _random_params(device, dtype, batch, seq, num_heads, head_dim, n_groups, ssm_state_size):
-    hidden_states = torch.randn(batch, seq, num_heads, head_dim, device=device, dtype=dtype)
-    A = torch.randn(num_heads, device=device, dtype=torch.float32)
-    B = torch.randn(batch, seq, n_groups, ssm_state_size, device=device, dtype=dtype)
-    C = torch.randn(batch, seq, n_groups, ssm_state_size, device=device, dtype=dtype)
-    D = torch.randn(num_heads, device=device, dtype=dtype)
-    dt = torch.randn(batch, seq, num_heads, device=device, dtype=dtype)
-    dt_bias = torch.randn(num_heads, device=device, dtype=dtype)
+    # Use bounded random values to avoid numerical edge cases
+    # torch.rand gives [0, 1), scale to [-0.5, 0.5] for stable values
+    hidden_states = torch.rand(batch, seq, num_heads, head_dim, device=device, dtype=dtype) - 0.5
+    A = torch.rand(num_heads, device=device, dtype=torch.float32) - 0.5
+    B = torch.rand(batch, seq, n_groups, ssm_state_size, device=device, dtype=dtype) - 0.5
+    C = torch.rand(batch, seq, n_groups, ssm_state_size, device=device, dtype=dtype) - 0.5
+    D = torch.rand(num_heads, device=device, dtype=dtype) - 0.5
+    dt = torch.rand(batch, seq, num_heads, device=device, dtype=dtype) - 0.5
+    dt_bias = torch.rand(num_heads, device=device, dtype=dtype) - 0.5
    time_step_limit = [1e-6, 1.0]
    chunk_size = 4
    return hidden_states, A, B, C, D, dt, dt_bias, time_step_limit, chunk_size
@ -28,29 +30,32 @@ def mamba_env():
    return {"device": device, "dtype": dtype, "atol": atol, "rtol": rtol}


-@pytest.mark.skip(reason="https://nvbugspro.nvidia.com/bug/5548861")
 def test_triton_generate_only_with_slot_mapping(mamba_env):
    device = mamba_env["device"]
    dtype = mamba_env["dtype"]
    atol = mamba_env["atol"]
    rtol = mamba_env["rtol"]

-    batch, seq = 3, 1
+    batch, seq = 1, 1
    num_heads, head_dim = 4, 8
    n_groups, ssm_state_size = 2, 4
    (hidden_states, A, B, C, D, dt, dt_bias, time_step_limit, chunk_size) = _random_params(
        device, dtype, batch, seq, num_heads, head_dim, n_groups, ssm_state_size
    )

-    max_batch_size = 6
-    slot_idx = torch.tensor([4, 1, 3], device=device, dtype=torch.int32)
+    max_batch_size = 2
+    slot_idx = torch.tensor([0], device=device, dtype=torch.int32)
    ssm_state_cache_torch = torch.randn(
        max_batch_size, num_heads, head_dim, ssm_state_size, device=device, dtype=dtype
    )
    ssm_state_cache_triton = ssm_state_cache_torch.clone()

+    # batch_info_host: [num_prefill, num_prefill_tokens, num_decode]
+    # For generate-only: num_decode = batch, num_prefill = 0
+    batch_info_host = torch.tensor([0, 0, batch], device=device, dtype=torch.int32)
    seq_len = torch.ones(batch, device=device, dtype=torch.int32)
-    seq_start = torch.zeros(batch, device=device, dtype=torch.int32)
+    cu_seqlen = torch.zeros(batch + 1, device=device, dtype=torch.int32)
+    use_initial_states = torch.zeros(batch, device=device, dtype=torch.bool)

    # Torch reference
    y_torch = torch.ops.auto_deploy.torch_cached_ssm(
@ -61,10 +66,15 @@ def test_triton_generate_only_with_slot_mapping(mamba_env):
        D,
        dt,
        dt_bias,
+        # STANDARD METADATA
+        batch_info_host,
        seq_len,
-        seq_start,
+        cu_seqlen,
        slot_idx,
+        use_initial_states,
+        # CACHES
        ssm_state_cache_torch,
+        # CONSTANTS
        time_step_limit,
        chunk_size,
    )
@ -78,10 +88,18 @@ def test_triton_generate_only_with_slot_mapping(mamba_env):
        D,
        dt,
        dt_bias,
-        seq_len,
-        seq_start,
+        # STANDARD METADATA
+        batch_info_host,
+        cu_seqlen,
        slot_idx,
+        use_initial_states,
+        # EXTRA METADATA
+        None,  # chunk indices
+        None,  # chunk offsets
+        None,  # seq_idx_prefill
+        # CACHES
        ssm_state_cache_triton,
+        # CONSTANTS
        time_step_limit,
        chunk_size,
    )
--- a/tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
+++ b/tests/unittest/_torch/modeling/test_modeling_qwen3vl.py
@ -237,12 +237,6 @@ class TestQwen3VL(TestModelingMultimodal):
                chunked_prefill=False,
                kv_cache_reuse=False,
            ),
-            # ==== Disable fuse rope scenarios ====
-            # TestQwen3VLScenario(modality="image",
-            #                        use_cuda_graph=False,
-            #                        disable_fuse_rope=True,
-            #                        chunked_prefill=False,
-            #                        kv_cache_reuse=False),
            # ==== Chunked Prefill Scenarios ====
            TestQwen3VLScenario(
                modality="image",
@ -259,6 +253,14 @@ class TestQwen3VL(TestModelingMultimodal):
                chunked_prefill=False,
                kv_cache_reuse=True,
            ),
+            # ==== Disable fuse rope scenarios ====
+            TestQwen3VLScenario(
+                modality="image",
+                use_cuda_graph=False,
+                disable_fuse_rope=True,
+                chunked_prefill=False,
+                kv_cache_reuse=False,
+            ),
        ]
        return scenarios