Release 0.20 to main (#4577)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: stnie <82932102+stnie@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2025-05-28 11:25:33 +03:00 · 2025-05-28 11:25:33 +03:00 · fbec0c3552
commit fbec0c3552
parent b800adc65c
45 changed files with 1305 additions and 393 deletions
--- a/.devcontainer/docker-compose.yml
+++ b/.devcontainer/docker-compose.yml
@ -1,7 +1,7 @@
 version: "3.9"
 services:
  tensorrt_llm-dev:
-    image: urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400
+    image: urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539
    network_mode: host
    ipc: host

--- a/constraints.txt
+++ b/constraints.txt
@ -1,2 +1,9 @@
+# These vulnerabilities were inherited from the base image (pytorch:25.05-py3) and should be removed when the base image
+# is updated.
+
 # WAR against https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
 h11>=0.16.0
+# WAR against https://github.com/advisories/GHSA-7cx3-6m66-7c5m
+tornado>=6.5.0
+# WAR against https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
+setuptools>=78.1.1
--- a/docker/Dockerfile.multi
+++ b/docker/Dockerfile.multi
@ -72,9 +72,14 @@ RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh
 RUN pip3 uninstall -y opencv && rm -rf /usr/local/lib/python3*/dist-packages/cv2/
 RUN pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir

-# WAR against https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
-RUN pip3 install --upgrade h11>=0.16 --no-cache-dir
-
+# WARs against security issues inherited from pytorch:25.04
+# * https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
+# * https://github.com/advisories/GHSA-7cx3-6m66-7c5m
+# * https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
+RUN pip3 install --upgrade --no-cache-dir \
+    "h11>=0.16" \
+    "tornado>=6.5.0" \
+    "setuptools>=78.1.1,<80"

 FROM ${TRITON_IMAGE}:${TRITON_BASE_TAG} AS triton

@ -173,5 +178,9 @@ RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh
 FROM release AS tritonrelease

 WORKDIR /app/tensorrt_llm
-COPY ./triton_backend/ ./triton_backend/
+COPY ./triton_backend/all_models ./triton_backend/all_models
+COPY ./triton_backend/scripts ./triton_backend/scripts
+COPY ./triton_backend/tools ./triton_backend/tools
+COPY ./triton_backend/inflight_batcher_llm/scripts ./triton_backend/inflight_batcher_llm/scripts
+COPY ./triton_backend/inflight_batcher_llm/client ./triton_backend/inflight_batcher_llm/client
 COPY --from=tritonbuild /opt/tritonserver/backends/tensorrtllm /opt/tritonserver/backends/tensorrtllm
--- a/docs/source/advanced/kv-cache-management.md
+++ b/docs/source/advanced/kv-cache-management.md
@ -0,0 +1,75 @@
+(kv-cache-management)=
+
+# KV Cache Management: Pools, Blocks, and Events
+
+This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.
+
+For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).
+
+---
+
+## Hierarchy: Pool, Block, and Page
+
+### **Block**
+- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
+- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
+- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.
+
+### **Page**
+- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
+- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.
+
+### **Pool**
+- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
+- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
+- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
+- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.
+
+### **WindowBlockManager/BlockManager**
+
+TRT-LLM supports 2 complex features related to KV cache management:
+1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
+2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.
+
+In order to support both of these features, the pool management works as described below.
+
+But in the simple, *most common case*, for most models, where
+1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
+2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,
+
+only a *single* pool will be created within the structure described below.
+
+#### KV Cache Pool Management
+
+- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
+- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.
+
+**Hierarchy Summary:**
+- **Pool** (memory buffer for KV data)
+  - Contains many blocks.
+- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
+    - (Optionally, blocks can be swapped between primary/secondary pools.)
+- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.
+
+---
+
+## Events in `KVCacheEventManager`
+
+The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.
+
+### **Types of Events**
+- **Created Event:** When pools or blocks are created/allocated.
+- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
+- **Removed Event:** When a block is removed from the cache (evicted or released).
+- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).
+
+### **What Triggers an Event?**
+- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
+- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
+- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
+- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).
+
+**In summary:**
+An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.
+
+---
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -104,6 +104,7 @@ Welcome to TensorRT-LLM's Documentation!
   advanced/inference-request.md
   advanced/lora.md
   advanced/expert-parallelism.md
+   advanced/kv-cache-management.md
   advanced/kv-cache-reuse.md
   advanced/speculative-decoding.md
   advanced/disaggregated-service.md
--- a/docs/source/torch/kv_cache_manager.md
+++ b/docs/source/torch/kv_cache_manager.md
@ -4,6 +4,8 @@ In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to opt
 Since KV Cache requires memory to store, it is also an important resource.
 In TensorRT-LLM, KV Cache is managed by the `KVCacheManager`.

+For details of the TensorRT-LLM `KVCacheManager` implementation see [KV Cache Management](../advanced/kv-cache-management.md).
+
 ## KV Cache Manager Introduction

 `KVCacheManager` is a type of resource manager, inheriting from `BaseResourceManager`.
--- a/examples/disaggregated/clients/run_loadgen.sh
+++ b/examples/disaggregated/clients/run_loadgen.sh
@ -1,24 +0,0 @@
-#!/bin/bash
-dataset="template_trtllm_openai_completions.json"
-output_folder="output_loadgen"
-port=8000
-host="localhost"
-max_count=256
-model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
-streaming="False"
-input_tokens=128
-output_tokens=128
-concurrency=32
-
-infserver_loadgen ${dataset} \
-    --output_dir "${output_folder}" \
-    --set dataset.input_tokens:int="${input_tokens}" \
-    --set dataset.output_tokens:int="${output_tokens}" \
-    --set dataset.max_count:int="${max_count}" \
-    --set dataset.model_name:str="${model_name}" \
-    --set dataset.max_concurrent_requests:int="${concurrency}" \
-    --set inference_server.host:str="${host}" \
-    --set inference_server.port:int="${port}" \
-    --set post_processors[0].model_name:str="${model_name}" \
-    --set timing_strategy.desired_rps:float="-1" \
-    --set inference_server.inference_server_config.stream:bool="${streaming}"
--- a/examples/disaggregated/clients/template_trtllm_openai_completions.json
+++ b/examples/disaggregated/clients/template_trtllm_openai_completions.json
@ -1,24 +0,0 @@
-{
-  "dataset": {
-      "type": "fixed_isl_osl"
-  },
-  "inference_server": {
-      "type": "trtllm_openai_completions",
-      "host": "test",
-      "port": null,
-      "inference_server_config": {
-        "model_name": "test"
-      }
-  },
-  "timing_strategy": {
-      "type": "fixed",
-      "desired_rps": -1
-    },
-  "post_processors": [
-    {
-      "type": "infbench_summary",
-      "model_name": "test"
-    }
-],
-  "timeout": null
-}
--- a/examples/models/core/llama/README.md
+++ b/examples/models/core/llama/README.md
@ -128,7 +128,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq \
            --output_dir ./tmp/llama/7B/trt_engines/weight_only/1-gpu/ \
            --gemm_plugin auto

-# Build LLaMA 7B using 2-way auto parallelism.
+# Build LLaMA 7B using 2-way auto parallelism (deprecated).
 python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \
                            --output_dir ./tllm_checkpoint_1gpu_fp16 \
                            --dtype float16
--- a/examples/summarize.py
+++ b/examples/summarize.py
@ -30,6 +30,9 @@ from utils import (DEFAULT_HF_MODEL_DIRS, add_common_args, get_beam_width_array,
 import tensorrt_llm
 import tensorrt_llm.profiler as profiler
 from tensorrt_llm._utils import mpi_broadcast, str_dtype_to_torch
+from tensorrt_llm.builder import EngineConfig
+from tensorrt_llm.functional import RopeEmbeddingUtils, RotaryScalingType
+from tensorrt_llm.layers import MropeParams
 from tensorrt_llm.logger import logger
 from tensorrt_llm.models.qwen.utils import make_context
 from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
@ -41,6 +44,42 @@ if PYTHON_BINDINGS:
 from prompt_lookup.run_dtm_pld import run_dtm_pld


+def ensemble_mrope_params(batch_input_ids, max_position_embeddings,
+                          rotary_embedding_dim, theta):
+    mrope_params = MropeParams()
+    batch_size = len(batch_input_ids)
+
+    _, rotary_cos_sin = RopeEmbeddingUtils.create_sinusoidal_positions_for_attention_plugin(
+        num_pos=max_position_embeddings,
+        dim=rotary_embedding_dim,
+        theta=1000000.0,
+        scale_type=RotaryScalingType.mrope,
+    )
+    rotary_cos_sin = torch.tensor(rotary_cos_sin).to(batch_input_ids[0].device)
+    rotary_cos_sin = rotary_cos_sin.reshape(max_position_embeddings,
+                                            int(rotary_embedding_dim / 2), 2)
+
+    cos_ori = rotary_cos_sin[:, :, 0]
+    sin_ori = rotary_cos_sin[:, :, 1]
+
+    mrope_position_ids_padding = torch.zeros(
+        (batch_size, max_position_embeddings), dtype=torch.int32)
+    for i in range(batch_size):
+        seq_len = batch_input_ids[i].shape[-1]
+        mrope_position_ids_padding[i, :seq_len] = torch.arange(
+            seq_len, device=batch_input_ids[i].device)
+
+    cos = cos_ori[mrope_position_ids_padding].unsqueeze(-1)
+    sin = sin_ori[mrope_position_ids_padding].unsqueeze(-1)
+
+    mrope_params.mrope_rotary_cos_sin = torch.concatenate(
+        (cos, sin), axis=-1).reshape(batch_size, -1)
+    mrope_params.mrope_position_deltas = torch.zeros(
+        [batch_size, 1], device=batch_input_ids[0].device)
+
+    return mrope_params
+
+
 def main(args):
    is_integration_test = os.getenv('INTEGRATION_TEST', '0') == '1'
    if is_integration_test:
@ -262,7 +301,19 @@ def main(args):
                                          eval_task=eval_task,
                                          add_special_tokens=add_special_tokens,
                                          min_input_length=min_input_length)
-        batch_size = len(batch_input_ids)
+        # Generate mrope params for qwen model
+        engine_config = EngineConfig.from_json_file(
+            f"{args.engine_dir}/config.json")
+        pretrain_config = engine_config.pretrained_config
+        mrope_params = None
+        if 'qwen' in model_name.lower():
+            mrope_params = ensemble_mrope_params(
+                batch_input_ids,
+                max_position_embeddings=pretrain_config.max_position_embeddings,
+                rotary_embedding_dim=pretrain_config.rotary_embedding_dim,
+                theta=pretrain_config.rotary_base,
+            )
+
        if batch_size == 0:
            return [], [], [], {}
        input_lengths = [x.size(0) for x in batch_input_ids]
@ -309,7 +360,8 @@ def main(args):
                    return_dict=True,
                    random_seed=random_seed,
                    medusa_choices=args.medusa_choices,
-                    eagle_choices=args.eagle_choices)
+                    eagle_choices=args.eagle_choices,
+                    mrope_params=mrope_params)
                torch.cuda.synchronize()

        # Extract a list of tensors of shape beam_width x output_ids.
--- a/jenkins/L0_MergeRequest.groovy
+++ b/jenkins/L0_MergeRequest.groovy
@ -28,10 +28,10 @@ UPLOAD_PATH = env.uploadPath ? env.uploadPath : "sw-tensorrt-generic/llm-artifac
 // Container configuration
 // available tags can be found in: https://urm.nvidia.com/artifactory/sw-tensorrt-docker/tensorrt-llm/
 // [base_image_name]-[arch]-[os](-[python_version])-[trt_version]-[torch_install_type]-[stage]-[date]-[mr_id]
-LLM_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
-LLM_SBSA_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-aarch64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
-LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py310-trt10.10.0.31-skip-tritondevel-202505191345-4400"
-LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py312-trt10.10.0.31-skip-tritondevel-202505191345-4400"
+LLM_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
+LLM_SBSA_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-aarch64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
+LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py310-trt10.10.0.31-skip-tritondevel-202505211401-4539"
+LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py312-trt10.10.0.31-skip-tritondevel-202505211401-4539"

 // TODO: Move common variables to an unified location
 BUILD_CORES_REQUEST = "8"
--- a/jenkins/controlCCache.groovy
+++ b/jenkins/controlCCache.groovy
@ -1,7 +1,7 @@

 import java.lang.InterruptedException

-DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
+DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"

 def createKubernetesPodConfig(image)
 {
--- a/tensorrt_llm/_torch/distributed/ops.py
+++ b/tensorrt_llm/_torch/distributed/ops.py
@ -1,6 +1,7 @@
 import math
 import os
 import threading
+from itertools import accumulate
 from typing import List, Optional, Tuple, Union

 import torch
@ -116,6 +117,24 @@ def get_output_info(input: torch.Tensor, dim: int) -> List[int]:
    return {'output_shape': output_shape, 'numel_base': numel_base}


+def filter_valid_input(
+        input_list: List[torch.Tensor]
+) -> Tuple[List[torch.Tensor], List[bool]]:
+    func_valid = lambda x: x is not None
+    valid_list = list(map(func_valid, input_list))
+    input_list = list(filter(func_valid, input_list))
+    return input_list, valid_list
+
+
+def restore_full_output(output_list: List[torch.Tensor],
+                        valid_list: List[bool]) -> List[torch.Tensor]:
+    index_list = list(accumulate(map(int, valid_list)))
+    output_list = list(
+        map(lambda valid, index: output_list[index - 1]
+            if valid else None, valid_list, index_list))
+    return output_list
+
+
 def allgather(
    input: Union[torch.Tensor, List[torch.Tensor]],
    mapping: Mapping,
@ -155,8 +174,10 @@ def allgather(
        if isinstance(input, torch.Tensor):
            assert input.shape[dim] == sizes[mapping.tp_rank]
        else:
-            assert all(
-                [val.shape[dim] == sizes[mapping.tp_rank] for val in input])
+            assert all([
+                val.shape[dim] == sizes[mapping.tp_rank] for val in input
+                if val is not None
+            ])
        # 'sizes' is not needed if all inputs in the same TP group have the same shape
        for split_size in sizes[1:]:
            if split_size != sizes[0]:
@ -170,6 +191,7 @@ def allgather(
        output_info = get_output_info(input, dim)
        input = input.contiguous().view(-1, output_info['numel_base'])
    else:
+        input, valid = filter_valid_input(input)
        torch_op = torch.ops.trtllm.allgather_list
        output_info = [get_output_info(val, dim) for val in input]
        input = [
@ -202,6 +224,7 @@ def allgather(
            convert_output(val, val_info)
            for val, val_info in zip(output, output_info)
        ]
+        output = restore_full_output(output, valid)
    return output


@ -220,7 +243,10 @@ def reducescatter(
        if isinstance(input, torch.Tensor):
            assert input.shape[dim] == sum_split_size
        else:
-            assert all([val.shape[dim] == sum_split_size for val in input])
+            assert all([
+                val.shape[dim] == sum_split_size for val in input
+                if val is not None
+            ])
        # 'sizes' is not needed if all outputs in the same TP group have the same shape
        for split_size in sizes[1:]:
            if split_size != sizes[0]:
@ -245,6 +271,7 @@ def reducescatter(
        output_info = get_output_info(input, dim)
        input = convert_input(input, output_info)
    else:
+        input, valid = filter_valid_input(input)
        torch_op = torch.ops.trtllm.reducescatter_list
        output_info = [get_output_info(val, dim) for val in input]
        input = [
@ -265,6 +292,7 @@ def reducescatter(
            val.view(val_info['output_shape'])
            for val, val_info in zip(output, output_info)
        ]
+        output = restore_full_output(output, valid)
    return output


--- a/tensorrt_llm/_torch/modules/fused_moe.py
+++ b/tensorrt_llm/_torch/modules/fused_moe.py
@ -1124,19 +1124,13 @@ class FusedMoE(nn.Module):

        if self.use_dp and self.parallel_size > 1 and not disable_fp4_allgather(
        ) and not self.enable_alltoall:
-            if x_sf is None:
-                x, token_selected_slots, token_final_scales = allgather(
-                    [x, token_selected_slots, token_final_scales],
-                    self.mapping,
-                    dim=0,
-                    sizes=None if use_dp_padding else all_rank_num_tokens)
-            else:
-                # Fp4 gemm has extra scaling factor
-                x, x_sf, token_selected_slots, token_final_scales = allgather(
-                    [x, x_sf, token_selected_slots, token_final_scales],
-                    self.mapping,
-                    dim=0,
-                    sizes=None if use_dp_padding else all_rank_num_tokens)
+            x, x_sf, token_selected_slots, token_final_scales = allgather(
+                [x, x_sf, token_selected_slots, token_final_scales],
+                self.mapping,
+                dim=0,
+                sizes=None if use_dp_padding else all_rank_num_tokens)
+            # Fp4 gemm has extra scaling factor
+            if x_sf is not None:
                x_sf = reswizzle_sf(x_sf, x_row, x_col,
                                    self.scaling_vector_size)

--- a/tensorrt_llm/auto_parallel/auto_parallel.py
+++ b/tensorrt_llm/auto_parallel/auto_parallel.py
@ -149,6 +149,9 @@ def infer_builder_flags(network):


 def auto_parallel(network: Network, config: AutoParallelConfig):
+    logger.warning(
+        "auto_parallel is deprecated, "
+        "please use explicit parallelism like tp_size/pp_size instead.")
    debug_mode = config.debug_mode
    memory_budget = config.get_cluster_info(
    ).memory_budget_per_device * 1024 * 1024 * 1024
--- a/tensorrt_llm/llmapi/llm_args.py
+++ b/tensorrt_llm/llmapi/llm_args.py
@ -1359,11 +1359,19 @@ class BaseLlmArgs(BaseModel):

 class TrtLlmArgs(BaseLlmArgs):

-    auto_parallel: bool = Field(default=False,
-                                description="Enable auto parallel mode.")
+    auto_parallel: bool = Field(
+        default=False,
+        description="Enable auto parallel mode.",
+        deprecated=
+        "Use tensor_parallel_size/pipeline_parallel_size/xxx_parallel_size instead.",
+    )

    auto_parallel_world_size: Optional[int] = Field(
-        default=None, description="The world size for auto parallel mode.")
+        default=None,
+        description="The world size for auto parallel mode.",
+        deprecated=
+        "Use tensor_parallel_size/pipeline_parallel_size/xxx_parallel_size instead.",
+    )

    enable_tqdm: bool = Field(default=False,
                              description="Enable tqdm for progress bar.")
--- a/tests/integration/defs/accuracy/accuracy_core.py
+++ b/tests/integration/defs/accuracy/accuracy_core.py
@ -434,6 +434,9 @@ class CliFlowAccuracyTestHarness:
            f"--dtype={self.dtype}",
        ]

+        if "nemotron_nas" in self.EXAMPLE_FOLDER:
+            convert_cmd.append("--trust_remote_code")
+
        if self.MODEL_FORMAT == "NEMO":
            convert_cmd.append(f"--nemo_ckpt_path={self.MODEL_PATH}")
        else:
--- a/tests/integration/defs/accuracy/references/cnn_dailymail.yaml
+++ b/tests/integration/defs/accuracy/references/cnn_dailymail.yaml
@ -137,6 +137,8 @@ meta-llama/Llama-3.2-1B:
  - quant_algo: FP8
    kv_cache_quant_algo: FP8
    accuracy: 27.029
+  - quant_algo: FP8
+    accuracy: 27.029
  - quant_algo: FP8_PER_CHANNEL_PER_TOKEN
    accuracy: 27.257
  - quant_algo: FP8_PER_CHANNEL_PER_TOKEN
@ -310,5 +312,3 @@ Qwen3/Qwen3-8B:
    accuracy: 30
 nvidia/Llama-3_3-Nemotron-Super-49B-v1:
  - accuracy: 34.003
-nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
-  - accuracy: 27.810
--- a/tests/integration/defs/accuracy/references/gpqa_diamond.yaml
+++ b/tests/integration/defs/accuracy/references/gpqa_diamond.yaml
@ -16,3 +16,12 @@ deepseek-ai/DeepSeek-R1:
    accuracy: 70.45
 nvidia/Llama-3_3-Nemotron-Super-49B-v1:
  - accuracy: 44.95
+  - quant_algo: FP8
+    accuracy: 49.49
+nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
+  - accuracy: 40.40
+nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
+  - accuracy: 58.08
+  - quant_algo: FP8
+    kv_cache_quant_algo: FP8
+    accuracy: 57.07
--- a/tests/integration/defs/accuracy/references/gsm8k.yaml
+++ b/tests/integration/defs/accuracy/references/gsm8k.yaml
@ -72,5 +72,14 @@ Qwen3/Qwen3-235B-A22B:
    accuracy: 85.78
 nvidia/Llama-3_3-Nemotron-Super-49B-v1:
  - accuracy: 92.57
+  - quant_algo: FP8
+    accuracy: 92.42
 nvidia/Nemotron-H-8B-Base-8K:
  - accuracy: 46.20
+nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
+  - accuracy: 37.15
+nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
+  - accuracy: 94.43
+  - quant_algo: FP8
+    kv_cache_quant_algo: FP8
+    accuracy: 94.16
--- a/tests/integration/defs/accuracy/references/mmlu.yaml
+++ b/tests/integration/defs/accuracy/references/mmlu.yaml
@ -28,6 +28,26 @@ meta-llama/Llama-3.1-8B-Instruct:
  - quant_algo: FP8
    kv_cache_quant_algo: FP8
    accuracy: 67.87
+meta-llama/Llama-3.2-1B:
+  - quant_algo: W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN
+    accuracy: 32.72
+  - quant_algo: W8A8_SQ_PER_CHANNEL
+    accuracy: 32.07
+  - quant_algo: W4A16_AWQ
+    accuracy: 30.56
+  - quant_algo: W4A16_AWQ
+    kv_cache_quant_algo: INT8
+    accuracy: 31.29
+  - quant_algo: FP8
+    kv_cache_quant_algo: FP8
+    accuracy: 31.02
+  - quant_algo: FP8_PER_CHANNEL_PER_TOKEN
+    accuracy: 33.97
+  - quant_algo: FP8_PER_CHANNEL_PER_TOKEN
+    extra_acc_spec: meta_recipe
+    accuracy: 33.87
+  - extra_acc_spec: max_attention_window_size=960
+    accuracy: 32.82
 meta-llama/Llama-3.3-70B-Instruct:
  - accuracy: 81.31
  - quant_algo: NVFP4
@ -128,9 +148,16 @@ Qwen3/Qwen3-235B-A22B:
    accuracy: 86
 nvidia/Llama-3_3-Nemotron-Super-49B-v1:
  - accuracy: 79.43
+  - quant_algo: FP8
+    accuracy: 79.26
 nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
  - accuracy: 57.97
 nvidia/Nemotron-H-8B-Base-8K:
  - accuracy: 69.590
 microsoft/Phi-4-mini-instruct:
  - accuracy: 68.98
+nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
+  - accuracy: 83.70
+  - quant_algo: FP8
+    kv_cache_quant_algo: FP8
+    accuracy: 83.36
--- a/tests/integration/defs/accuracy/test_cli_flow.py
+++ b/tests/integration/defs/accuracy/test_cli_flow.py
@ -200,6 +200,97 @@ class TestNemotronMini4BInstruct(CliFlowAccuracyTestHarness):
        self.run(quant_algo=QuantAlgo.FP8, kv_cache_quant_algo=QuantAlgo.FP8)


+# TODO: Remove the CLI tests once NIMs use PyTorch backend
+class TestLlama3_3NemotronSuper49Bv1(CliFlowAccuracyTestHarness):
+    MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"
+    MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1"
+    EXAMPLE_FOLDER = "models/core/nemotron_nas"
+
+    @pytest.mark.skip_less_device(2)
+    def test_auto_dtype_tp2(self):
+        self.run(tasks=[MMLU(self.MODEL_NAME)], tp_size=2, dtype='auto')
+
+    @pytest.mark.skip(
+        reason="nemotron-nas scripts have to accommodate fp8 flags")
+    @pytest.mark.skip_less_device(2)
+    @pytest.mark.skip_device_not_contain(["H100", "B200"])
+    def test_fp8_prequantized_tp2(self, mocker):
+        mocker.patch.object(
+            self.__class__, "MODEL_PATH",
+            f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8"
+        )
+        self.run(tasks=[MMLU(self.MODEL_NAME)],
+                 tp_size=2,
+                 quant_algo=QuantAlgo.FP8)
+
+
+class TestNemotronNano(CliFlowAccuracyTestHarness):
+    MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
+    MODEL_PATH = f"{llm_models_root()}/Llama-3.1-Nemotron-Nano-8B-v1"
+    EXAMPLE_FOLDER = "models/core/llama"
+
+    def test_auto_dtype(self):
+        self.run(tasks=[MMLU(self.MODEL_NAME)], dtype='auto')
+
+
+class TestNemotronUltra(CliFlowAccuracyTestHarness):
+    MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"
+    MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1"
+    EXAMPLE_FOLDER = "models/core/nemotron_nas"
+
+    @skip_pre_hopper
+    @pytest.mark.skip_less_device(8)
+    @pytest.mark.skip_device_not_contain(["H100", "B200"])
+    @parametrize_with_ids("cuda_graph", [False, True])
+    @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
+                                                         (8, 1, 8)],
+                             ids=["tp8", "tp8ep4", "tp8ep8"])
+    def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
+        extra_summarize_args = []
+        if cuda_graph:
+            extra_summarize_args.append("--cuda_graph_mode")
+
+        self.run(tasks=[MMLU(self.MODEL_NAME)],
+                 tp_size=tp_size,
+                 pp_size=pp_size,
+                 extra_convert_args=[
+                     f"--moe_tp_size={tp_size // ep_size}",
+                     f"--moe_ep_size={ep_size}", f"--moe_renorm_mode={0}"
+                 ],
+                 extra_build_args=["--gemm_plugin=auto", "--moe_plugin=auto"],
+                 extra_summarize_args=extra_summarize_args)
+
+    @skip_pre_hopper
+    @pytest.mark.skip_less_device(8)
+    @pytest.mark.skip_device_not_contain(["H100", "B200"])
+    @parametrize_with_ids("cuda_graph", [False, True])
+    @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
+                                                         (8, 1, 8)],
+                             ids=["tp8", "tp8ep4", "tp8ep8"])
+    def test_fp8_prequantized(self, cuda_graph, tp_size, pp_size, ep_size,
+                              mocker):
+        mocker.patch.object(
+            self.__class__, "MODEL_PATH",
+            f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1-FP8"
+        )
+
+        extra_summarize_args = []
+        if cuda_graph:
+            extra_summarize_args.append("--cuda_graph_mode")
+
+        self.run(tasks=[MMLU(self.MODEL_NAME)],
+                 quant_algo=QuantAlgo.FP8,
+                 kv_cache_quant_algo=QuantAlgo.FP8,
+                 tp_size=tp_size,
+                 pp_size=pp_size,
+                 extra_convert_args=[
+                     f"--moe_tp_size={tp_size // ep_size}",
+                     f"--moe_ep_size={ep_size}", f"--moe_renorm_mode={0}"
+                 ],
+                 extra_build_args=["--gemm_plugin=auto", "--moe_plugin=auto"],
+                 extra_summarize_args=extra_summarize_args)
+
+
@skip_post_blackwell
 class TestPhi2(CliFlowAccuracyTestHarness):
    MODEL_NAME = "microsoft/phi-2"
@ -847,9 +938,7 @@ class TestLlama3_3_70BInstruct(CliFlowAccuracyTestHarness):
    @pytest.mark.skip_device_not_contain(["B200"])
    def test_nvfp4_prequantized_tp4(self, mocker):
        mocker.patch.object(
-            self.__class__,
-            "MODEL_PATH",
-            model_path=
+            self.__class__, "MODEL_PATH",
            f"{llm_models_root()}/modelopt-hf-model-hub/Llama-3.3-70B-Instruct-fp4"
        )
        self.run(tasks=[MMLU(self.MODEL_NAME)],
--- a/tests/integration/defs/accuracy/test_disaggregated_serving.py
+++ b/tests/integration/defs/accuracy/test_disaggregated_serving.py
@ -2,12 +2,12 @@
 # I need to to this by creating a new class that mimics LLM class. Instead of implementing the
 # actual methods it will send OAI requests to the disaggregated serving endpoint.
 # Please take a look at the existing test_llm_api_pytorch.py file for reference.
-
+import concurrent
+import contextlib
 import os
-import shutil
-import subprocess
 import tempfile
 import time
+from collections import namedtuple
 from concurrent.futures import ThreadPoolExecutor
 from typing import Any, Dict, List, Optional

@ -16,11 +16,12 @@ import pytest
 import requests
 import yaml

-from tensorrt_llm._torch import LLM
 from tensorrt_llm.executor.result import GenerationResultBase
 from tensorrt_llm.llmapi import CompletionOutput, RequestOutput, SamplingParams
+from tensorrt_llm.llmapi.llm_args import LlmArgs

 from ..conftest import llm_models_root
+from ..trt_test_alternative import popen
 from .accuracy_core import GSM8K, MMLU, LlmapiAccuracyTestHarness


@ -40,76 +41,85 @@ class Result(GenerationResultBase):
        return self


-class OpenAIServerClient:
+DuckLLM = namedtuple('DuckLLM', ['args', 'generate_async'])

-    def __init__(self,
-                 disaggregated_server_config: Dict[str, Any],
-                 ctx_server_config: Dict[str, Any],
-                 gen_server_config: Dict[str, Any],
-                 model_name: str,
-                 tensor_parallel_size: int = 1):
-        self.thread_pool = ThreadPoolExecutor(max_workers=16)
-        self.temp_dir = tempfile.mkdtemp()
-        self.futures = []
-        self.disaggregated_serving_config_path = os.path.join(
-            self.temp_dir, "disaggregated_serving_config.yaml")
-        with open(self.disaggregated_serving_config_path, "w") as f:
-            yaml.dump(disaggregated_server_config, f)
-        ctx_server_config_path = os.path.join(self.temp_dir,
-                                              "ctx_server_config.yaml")
-        with open(ctx_server_config_path, "w") as f:
-            yaml.dump(ctx_server_config, f)
-        gen_server_config_path = os.path.join(self.temp_dir,
-                                              "gen_server_config.yaml")
-        with open(gen_server_config_path, "w") as f:
-            yaml.dump(gen_server_config, f)

-        with LLM(model_name, tensor_parallel_size=tensor_parallel_size) as llm:
-            self.args = llm.args
+class MyThreadPoolExecutor(ThreadPoolExecutor):

-        cuda_device_idx = 0
-        cuda_devices = []
-        for i in range(tensor_parallel_size):
-            cuda_devices.append(f"{cuda_device_idx}")
-            cuda_device_idx += 1
+    def __init__(self, *args, **kwargs) -> None:
+        super().__init__(*args, **kwargs)
+        self.futures: list[concurrent.futures.Future[RequestOutput]] = []

-        trtllm_serve_path = "trtllm-serve"
-        # Common arguments for both servers
-        common_args = [
-            trtllm_serve_path, model_name, "--host", "localhost", "--backend",
-            "pytorch"
-        ]
-        if tensor_parallel_size > 1:
-            common_args.append(f"--tp_size={tensor_parallel_size}")
-        env_ctx = os.environ.copy()
-        env_ctx["TRTLLM_USE_UCX_KVCACHE"] = "1"
-        env_ctx["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_devices)
-        # Start the context server
-        self._ctx_server = subprocess.Popen(common_args + [
-            "--port", "8001", "--extra_llm_api_options", ctx_server_config_path
-        ],
-                                            env=env_ctx)
-        # Start the generation server
-        env_gen = os.environ.copy()
-        env_gen["TRTLLM_USE_UCX_KVCACHE"] = "1"
-        cuda_devices = []
-        for i in range(tensor_parallel_size):
-            cuda_devices.append(f"{cuda_device_idx}")
-            cuda_device_idx += 1
-        env_gen["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_devices)
-        self._gen_server = subprocess.Popen(common_args + [
-            "--port", "8002", "--extra_llm_api_options", gen_server_config_path
-        ],
-                                            env=env_gen)
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        if exc_type is None:
+            for future in self.futures:
+                future.result()
+            return super().__exit__(exc_type, exc_val, exc_tb)

-        # Start the disaggregated server
-        self._disaggregated_server = subprocess.Popen([
-            trtllm_serve_path, "disaggregated", "-c",
-            self.disaggregated_serving_config_path, "--server_start_timeout",
-            "3600"
-        ])
-        self.model_name = model_name
+        for future in self.futures:
+            future.cancel()
+        self.shutdown(wait=False, cancel_futures=True)
+        return False

+
+@contextlib.contextmanager
+def launch_disaggregated_llm(disaggregated_server_config: Dict[str, Any],
+                             ctx_server_config: Dict[str, Any],
+                             gen_server_config: Dict[str, Any],
+                             model_name: str,
+                             tensor_parallel_size: int = 1):
+    temp_dir = tempfile.TemporaryDirectory()
+    disaggregated_serving_config_path = os.path.join(
+        temp_dir.name, "disaggregated_serving_config.yaml")
+    with open(disaggregated_serving_config_path, "w") as f:
+        yaml.dump(disaggregated_server_config, f)
+    ctx_server_config_path = os.path.join(temp_dir.name,
+                                          "ctx_server_config.yaml")
+    with open(ctx_server_config_path, "w") as f:
+        yaml.dump(ctx_server_config, f)
+    gen_server_config_path = os.path.join(temp_dir.name,
+                                          "gen_server_config.yaml")
+    with open(gen_server_config_path, "w") as f:
+        yaml.dump(gen_server_config, f)
+
+    args = LlmArgs.from_kwargs(model=model_name,
+                               tensor_parallel_size=tensor_parallel_size)
+
+    trtllm_serve_path = "trtllm-serve"
+    # Common arguments for both servers
+    common_args = [
+        trtllm_serve_path, model_name, "--host", "localhost", "--backend",
+        "pytorch"
+    ]
+    if tensor_parallel_size > 1:
+        common_args.append(f"--tp_size={tensor_parallel_size}")
+
+    env_ctx = os.environ.copy()
+    env_ctx["TRTLLM_USE_UCX_KVCACHE"] = "1"
+    env_ctx["CUDA_VISIBLE_DEVICES"] = ",".join(
+        map(str, range(tensor_parallel_size)))
+
+    env_gen = os.environ.copy()
+    env_gen["TRTLLM_USE_UCX_KVCACHE"] = "1"
+    env_gen["CUDA_VISIBLE_DEVICES"] = ",".join(
+        map(str, range(tensor_parallel_size, 2 * tensor_parallel_size)))
+
+    with (MyThreadPoolExecutor(max_workers=16) as thread_pool, temp_dir,
+          popen(common_args + [
+              "--port", "8001", "--extra_llm_api_options",
+              ctx_server_config_path
+          ],
+                env=env_ctx) as ctx_server,
+          popen(common_args + [
+              "--port", "8002", "--extra_llm_api_options",
+              gen_server_config_path
+          ],
+                env=env_gen) as gen_server,
+          popen([
+              trtllm_serve_path, "disaggregated", "-c",
+              disaggregated_serving_config_path, "--server_start_timeout",
+              "3600"
+          ]) as disaggregated_server):
        while True:
            time.sleep(1)
            try:
@ -120,54 +130,47 @@ class OpenAIServerClient:
            except requests.exceptions.ConnectionError:
                continue

-        self.client = openai.OpenAI(api_key="1234567890",
-                                    base_url=f"http://localhost:8000/v1")
+        client = openai.OpenAI(api_key="1234567890",
+                               base_url=f"http://localhost:8000/v1")

-    def send_request(self, prompt: str, sampling_params: SamplingParams):
-        response = self.client.completions.create(
-            model=self.model_name,
-            prompt=prompt,
-            stream=False,
-            **({
-                "max_tokens": sampling_params.max_tokens,
-                "temperature": sampling_params.temperature,
-                "top_p": sampling_params.top_p,
-                "stop": sampling_params.stop,
-                "seed": sampling_params.seed
-            } if sampling_params else {}))
-        result = Result(
-            id=0,
-            sampling_params=sampling_params,
-            outputs=[CompletionOutput(text=response.choices[0].text, index=0)])
-        requested_output = RequestOutput._from_generation_result(result,
-                                                                 prompt=prompt)
-        setattr(requested_output, "result", result.result)
-        return requested_output
+        def send_request(prompt: str, sampling_params: SamplingParams):
+            response = client.completions.create(
+                model=model_name,
+                prompt=prompt,
+                stream=False,
+                **({
+                    "max_tokens": sampling_params.max_tokens,
+                    "temperature": sampling_params.temperature,
+                    "top_p": sampling_params.top_p,
+                    "stop": sampling_params.stop,
+                    "seed": sampling_params.seed
+                } if sampling_params else {}))
+            result = Result(id=0,
+                            sampling_params=sampling_params,
+                            outputs=[
+                                CompletionOutput(text=response.choices[0].text,
+                                                 index=0)
+                            ])
+            requested_output = RequestOutput._from_generation_result(
+                result, prompt=prompt)
+            setattr(requested_output, "result", result.result)
+            return requested_output

-    def generate_async(self,
-                       prompt: str,
-                       sampling_params: Optional[SamplingParams] = None):
-        future = self.thread_pool.submit(self.send_request, prompt,
-                                         sampling_params)
-        self.futures.append(future)
-        return future
+        def generate_async(prompt: str,
+                           sampling_params: Optional[SamplingParams] = None):
+            future = thread_pool.submit(send_request, prompt, sampling_params)
+            thread_pool.futures.append(future)
+            return future

-    def __enter__(self):
-        return self
+        yield DuckLLM(args, generate_async)

-    def __exit__(self, exc_type, exc_value, traceback):
-        shutil.rmtree(self.temp_dir)
-        self._ctx_server.terminate()
-        self._gen_server.terminate()
-        self._disaggregated_server.terminate()
+        ctx_server.terminate()
+        gen_server.terminate()
+        disaggregated_server.terminate()

-        self._ctx_server.wait()
-        self._gen_server.wait()
-        self._disaggregated_server.wait()
-
-        for future in self.futures:
-            future.result()
-        self.thread_pool.shutdown(wait=True)
+        ctx_server.wait()
+        gen_server.wait()
+        disaggregated_server.wait()


 class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
@ -201,12 +204,13 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
                "urls": ["localhost:8002"]
            }
        }
-        with OpenAIServerClient(disaggregated_server_config, ctx_server_config,
-                                gen_server_config, self.MODEL_PATH) as client:
+        with launch_disaggregated_llm(disaggregated_server_config,
+                                      ctx_server_config, gen_server_config,
+                                      self.MODEL_PATH) as llm:
            task = MMLU(self.MODEL_NAME)
-            task.evaluate(client)
+            task.evaluate(llm)
            task = GSM8K(self.MODEL_NAME)
-            task.evaluate(client)
+            task.evaluate(llm)


 class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
@ -215,6 +219,7 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):

    @pytest.mark.parametrize("overlap_scheduler", [False, True])
    def test_auto_dtype(self, overlap_scheduler):
+        pytest.skip("https://nvbugs/5297821")
        ctx_server_config = {
            "pytorch_backend_config": {
                "disable_overlap_scheduler": True
@ -238,12 +243,12 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
                "urls": ["localhost:8002"]
            }
        }
-        with OpenAIServerClient(disaggregated_server_config,
-                                ctx_server_config,
-                                gen_server_config,
-                                self.MODEL_PATH,
-                                tensor_parallel_size=4) as client:
+        with launch_disaggregated_llm(disaggregated_server_config,
+                                      ctx_server_config,
+                                      gen_server_config,
+                                      self.MODEL_PATH,
+                                      tensor_parallel_size=4) as llm:
            task = MMLU(self.MODEL_NAME)
-            task.evaluate(client)
+            task.evaluate(llm)
            task = GSM8K(self.MODEL_NAME)
-            task.evaluate(client)
+            task.evaluate(llm)
--- a/tests/integration/defs/accuracy/test_llm_api_pytorch.py
+++ b/tests/integration/defs/accuracy/test_llm_api_pytorch.py
@ -188,10 +188,11 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
            task = GSM8K(self.MODEL_NAME)
            task.evaluate(llm)

+    @pytest.mark.skip(reason="https://nvbugspro.nvidia.com/bug/5292517")
    @skip_pre_hopper
-    def test_fp8_llm_decoder(self):
+    def test_fp8_llm_sampler(self):
        model_path = f"{llm_models_root()}/llama-3.1-model/Llama-3.1-8B-Instruct-FP8"
-        pytorch_config = PyTorchConfig(enable_trtllm_decoder=True)
+        pytorch_config = PyTorchConfig(enable_trtllm_sampler=True)
        llm = LLM(model_path, pytorch_backend_config=pytorch_config)
        assert llm.args.quant_config.quant_algo == QuantAlgo.FP8

@ -207,6 +208,79 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
                          extra_acc_spec="temperature=0.8,top_p=0.95")


+class TestLlama3_2_1B(LlmapiAccuracyTestHarness):
+    MODEL_NAME = "meta-llama/Llama-3.2-1B"
+    MODEL_PATH = f"{llm_models_root()}/llama-3.2-models/Llama-3.2-1B"
+    EXAMPLE_FOLDER = "models/core/llama"
+
+    def test_auto_dtype(self):
+        with LLM(self.MODEL_PATH) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_post_blackwell
+    def test_smooth_quant(self):
+        quant_config = QuantConfig(
+            QuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN)
+        with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_post_blackwell
+    def test_smooth_quant_ootb(self):
+        quant_config = QuantConfig(QuantAlgo.W8A8_SQ_PER_CHANNEL)
+        with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_post_blackwell
+    def test_int4_awq(self):
+        quant_config = QuantConfig(QuantAlgo.W4A16_AWQ)
+        with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_post_blackwell
+    def test_int4_awq_int8_kv_cache(self):
+        quant_config = QuantConfig(QuantAlgo.W4A16_AWQ)
+        kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.INT8)
+        with LLM(self.MODEL_PATH,
+                 quant_config=quant_config,
+                 kv_cache_config=kv_cache_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_pre_ada
+    def test_fp8(self):
+        quant_config = QuantConfig(QuantAlgo.FP8)
+        kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.FP8)
+        with LLM(self.MODEL_PATH,
+                 quant_config=quant_config,
+                 kv_cache_config=kv_cache_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_pre_ada
+    @pytest.mark.skip_less_device(2)
+    def test_fp8_pp2(self):
+        quant_config = QuantConfig(QuantAlgo.FP8)
+        kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.FP8)
+        with LLM(self.MODEL_PATH,
+                 pipeline_parallel_size=2,
+                 quant_config=quant_config,
+                 kv_cache_config=kv_cache_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+    @skip_pre_ada
+    @skip_post_blackwell
+    def test_fp8_rowwise(self):
+        quant_config = QuantConfig(QuantAlgo.FP8_PER_CHANNEL_PER_TOKEN)
+        with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+
+
 class TestLlama3_3_70BInstruct(LlmapiAccuracyTestHarness):
    MODEL_NAME = "meta-llama/Llama-3.3-70B-Instruct"

@ -924,7 +998,7 @@ class TestNemotronNas(LlmapiAccuracyTestHarness):


@pytest.mark.skip_less_device_memory(80000)
-class TestNemotronSuper(LlmapiAccuracyTestHarness):
+class TestLlama3_3NemotronSuper49Bv1(LlmapiAccuracyTestHarness):
    MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"
    MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1"

@ -939,6 +1013,20 @@ class TestNemotronSuper(LlmapiAccuracyTestHarness):
            task.evaluate(llm,
                          extra_evaluator_kwargs=dict(apply_chat_template=True))

+    @pytest.mark.skip_less_device(2)
+    @pytest.mark.skip_device_not_contain(["H100", "B200"])
+    def test_fp8_prequantized_tp2(self):
+        model_path = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8"
+        with LLM(model_path, tensor_parallel_size=2) as llm:
+            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
+            task = MMLU(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GSM8K(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GPQADiamond(self.MODEL_NAME)
+            task.evaluate(llm,
+                          extra_evaluator_kwargs=dict(apply_chat_template=True))
+

 class TestNemotronNano(LlmapiAccuracyTestHarness):
    MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
@ -946,10 +1034,61 @@ class TestNemotronNano(LlmapiAccuracyTestHarness):

    def test_auto_dtype(self):
        with LLM(self.MODEL_PATH) as llm:
-            task = CnnDailymail(self.MODEL_NAME)
-            task.evaluate(llm)
            task = MMLU(self.MODEL_NAME)
            task.evaluate(llm)
+            task = GSM8K(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GPQADiamond(self.MODEL_NAME)
+            task.evaluate(llm,
+                          extra_evaluator_kwargs=dict(apply_chat_template=True))
+
+
+class TestNemotronUltra(LlmapiAccuracyTestHarness):
+    MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"
+    MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1"
+
+    @pytest.mark.skip_less_device(8)
+    @pytest.mark.skip_device_not_contain(["H100", "B200"])
+    @parametrize_with_ids("cuda_graph", [False, True])
+    @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
+                                                         (8, 1, 8)],
+                             ids=["tp8", "tp8ep4", "tp8ep8"])
+    def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
+        with LLM(self.MODEL_PATH,
+                 tensor_parallel_size=tp_size,
+                 pipeline_parallel_size=pp_size,
+                 moe_expert_parallel_size=ep_size,
+                 use_cuda_graph=cuda_graph) as llm:
+            task = MMLU(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GSM8K(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GPQADiamond(self.MODEL_NAME)
+            task.evaluate(llm,
+                          extra_evaluator_kwargs=dict(apply_chat_template=True))
+
+    @pytest.mark.skip_less_device(8)
+    @pytest.mark.skip_device_not_contain(["H100", "B200"])
+    @parametrize_with_ids("cuda_graph", [False, True])
+    @pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
+                                                         (8, 1, 8)],
+                             ids=["tp8", "tp8ep4", "tp8ep8"])
+    def test_fp8_prequantized(self, cuda_graph, tp_size, pp_size, ep_size):
+        model_path = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1-FP8"
+        with LLM(model_path,
+                 tensor_parallel_size=tp_size,
+                 pipeline_parallel_size=pp_size,
+                 moe_expert_parallel_size=ep_size,
+                 use_cuda_graph=cuda_graph) as llm:
+            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
+            assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8
+            task = MMLU(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GSM8K(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GPQADiamond(self.MODEL_NAME)
+            task.evaluate(llm,
+                          extra_evaluator_kwargs=dict(apply_chat_template=True))


 class TestNemotronH(LlmapiAccuracyTestHarness):
@ -1185,3 +1324,24 @@ class TestQwen3_235B_A22B(LlmapiAccuracyTestHarness):
            task.evaluate(llm)
            task = GSM8K(self.MODEL_NAME)
            task.evaluate(llm)
+
+
+class TestPhi4MiniInstruct(LlmapiAccuracyTestHarness):
+    MODEL_NAME = "microsoft/Phi-4-mini-instruct"
+    MODEL_PATH = f"{llm_models_root()}/Phi-4-mini-instruct"
+
+    @pytest.mark.skip(
+        reason=
+        "Temporarily skipping test_auto_dtype while resolving Phi-4's architecture issue."
+    )
+    def test_auto_dtype(self):
+        with LLM(self.MODEL_PATH) as llm:
+            task = CnnDailymail(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = MMLU(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GSM8K(self.MODEL_NAME)
+            task.evaluate(llm)
+            task = GPQADiamond(self.MODEL_NAME)
+            task.evaluate(llm,
+                          extra_evaluator_kwargs=dict(apply_chat_template=True))
--- a/tests/integration/defs/common.py
+++ b/tests/integration/defs/common.py
@ -24,23 +24,23 @@ from packaging import version
 from .trt_test_alternative import check_call, check_output, exists, is_windows


-def venv_check_call(venv, cmd, running_log=None, env=None):
+def venv_check_call(venv, cmd, env=None, **kwargs):

    def _war_check_call(*args, **kwargs):
        kwargs["cwd"] = venv.get_working_directory()
        return check_call(*args, **kwargs)

-    venv.run_cmd(cmd, caller=_war_check_call, running_log=running_log, env=env)
+    venv.run_cmd(cmd, caller=_war_check_call, env=env, **kwargs)


-def venv_check_output(venv, cmd):
+def venv_check_output(venv, cmd, env=None, **kwargs):

    def _war_check_output(*args, **kwargs):
        kwargs["cwd"] = venv.get_working_directory()
        output = check_output(*args, **kwargs)
        return output

-    return venv.run_cmd(cmd, caller=_war_check_output)
+    return venv.run_cmd(cmd, caller=_war_check_output, env=env, **kwargs)


 def venv_mpi_check_call(venv, mpi_cmd, python_cmd):
--- a/tests/integration/defs/conftest.py
+++ b/tests/integration/defs/conftest.py
@ -22,6 +22,7 @@ import subprocess as sp
 import tempfile
 import time
 import urllib.request
+import warnings
 from functools import wraps
 from pathlib import Path
 from typing import Iterable, Sequence
@ -2196,8 +2197,10 @@ def skip_by_host_memory(request):

 IS_UNDER_CI_ENV = 'JENKINS_HOME' in os.environ

+gpu_warning_threshold = 1024 * 1024 * 1024

-def collect_status():
+
+def collect_status(item: pytest.Item):
    if not IS_UNDER_CI_ENV:
        return

@ -2210,6 +2213,22 @@ def collect_status():
        for idx in range(pynvml.nvmlDeviceGetCount())
    }

+    deadline = time.perf_counter() + 60  # 1 min
+    observed_used = 0
+    global gpu_warning_threshold
+
+    while time.perf_counter() < deadline:
+        observed_used = max(
+            pynvml.nvmlDeviceGetMemoryInfo(device).used
+            for device in handles.values())
+        if observed_used <= gpu_warning_threshold:
+            break
+        time.sleep(1)
+    else:
+        gpu_warning_threshold = max(observed_used, gpu_warning_threshold)
+        warnings.warn(
+            f"Test {item.name} does not free up GPU memory correctly!")
+
    gpu_memory = {}
    for idx, device in handles.items():
        total_used = pynvml.nvmlDeviceGetMemoryInfo(device).used // 1024 // 1024
@ -2218,13 +2237,12 @@ def collect_status():
        process = {}

        for entry in detail:
-            host_memory_in_mbs = -1
            try:
-                host_memory_in_mbs = psutil.Process(
-                    entry.pid).memory_full_info().uss // 1024 // 1024
+                p = psutil.Process(entry.pid)
+                host_memory_in_mbs = p.memory_full_info().uss // 1024 // 1024
                process[entry.pid] = (entry.usedGpuMemory // 1024 // 1024,
-                                      host_memory_in_mbs)
-            except:
+                                      host_memory_in_mbs, p.cmdline())
+            except Exception:
                pass

        gpu_memory[idx] = {
@ -2239,7 +2257,7 @@ def collect_status():
@pytest.hookimpl(wrapper=True)
 def pytest_runtest_protocol(item, nextitem):
    ret = yield
-    collect_status()
+    collect_status(item)
    return ret


--- a/tests/integration/defs/disaggregated/test_disaggregated.py
+++ b/tests/integration/defs/disaggregated/test_disaggregated.py
@ -18,14 +18,7 @@ import subprocess

 import pytest
 from defs.conftest import skip_no_hopper
-
-
-def kill_disaggregated_processes():
-    """Kill any existing disaggregated processes."""
-    try:
-        subprocess.run(['pkill', '-9', '-f', 'trtllm-serve'], check=False)
-    except Exception:
-        pass
+from defs.trt_test_alternative import check_call, popen


 def cleanup_output_files():
@ -120,93 +113,92 @@ def run_disaggregated_test(example_dir,
                           env=None,
                           cwd=None):
    """Run disaggregated test with given configuration."""
-    kill_disaggregated_processes()
    cleanup_output_files()

    num_ranks, config_file = get_test_config(test_desc, example_dir,
                                             os.path.dirname(__file__))

-    # Start workers
    workers_cmd = [
        'mpirun', '--allow-run-as-root', '--oversubscribe', '-n',
        str(num_ranks), 'trtllm-serve', 'disaggregated_mpi_worker', '-c',
        config_file
    ]
-    with open('output_workers.log', 'w') as f:
-        workers_proc = subprocess.Popen(workers_cmd,
-                                        stdout=f,
-                                        stderr=subprocess.STDOUT,
-                                        env=env,
-                                        cwd=cwd)

    server_start_timeout = 900
-    # Start server
    server_cmd = [
        'trtllm-serve', 'disaggregated', '--server_start_timeout',
        str(server_start_timeout), '-c', config_file
    ]
-    with open('output_disagg.log', 'w') as f:
-        server_proc = subprocess.Popen(server_cmd,
-                                       stdout=f,
-                                       stderr=subprocess.STDOUT,
-                                       env=env,
-                                       cwd=cwd)

-    client_dir = f"{example_dir}/clients"
-    for _ in range(num_iters):
-        client_cmd = [
-            'python3', f'{client_dir}/disagg_client.py', '-c',
-            f'{example_dir}/disagg_config.yaml', '-p',
-            f'{client_dir}/prompts.json', '--ignore-eos',
-            '--server-start-timeout',
-            str(server_start_timeout)
-        ]
-        subprocess.run(client_cmd, check=True, env=env)
-
-        # Streaming client run
-        streaming_client_cmd = client_cmd + [
-            '--streaming', '-o', 'output_streaming.json'
-        ]
-        subprocess.run(streaming_client_cmd, check=True, env=env)
-
-        # Run the chat completion endpoint test only for TinyLlama
-        if test_desc == "overlap":
-            chat_client_cmd = client_cmd + [
-                '-e', 'chat', '-o', 'output_chat.json'
+    with (  # Start workers
+            open('output_workers.log', 'w') as output_workers,
+            popen(workers_cmd,
+                  stdout=output_workers,
+                  stderr=subprocess.STDOUT,
+                  env=env,
+                  cwd=cwd),
+            # Start server
+            open('output_disagg.log', 'w') as output_disagg,
+            popen(server_cmd,
+                  stdout=output_disagg,
+                  stderr=subprocess.STDOUT,
+                  env=env,
+                  cwd=cwd)):
+        client_dir = f"{example_dir}/clients"
+        for _ in range(num_iters):
+            client_cmd = [
+                'python3', f'{client_dir}/disagg_client.py', '-c',
+                f'{example_dir}/disagg_config.yaml', '-p',
+                f'{client_dir}/prompts.json', '--ignore-eos',
+                '--server-start-timeout',
+                str(server_start_timeout)
            ]
-            subprocess.run(chat_client_cmd, check=True, env=env)
+            check_call(client_cmd, env=env)

-            streaming_chat_client_cmd = chat_client_cmd + [
-                '--streaming', '-o', 'output_streaming_chat.json'
+            # Streaming client run
+            streaming_client_cmd = client_cmd + [
+                '--streaming', '-o', 'output_streaming.json'
            ]
-            subprocess.run(streaming_chat_client_cmd, check=True, env=env)
+            check_call(streaming_client_cmd, env=env)

-        # Verify outputs
-        not_expected_strings = ["Berlin Berlin"]
+            # Run the chat completion endpoint test only for TinyLlama
+            if test_desc == "overlap":
+                chat_client_cmd = client_cmd + [
+                    '-e', 'chat', '-o', 'output_chat.json'
+                ]
+                check_call(chat_client_cmd, env=env)

-        output_files = ['output.json', 'output_streaming.json']
-        if test_desc == "overlap":
-            # Disable streaming chat completion for overlap test
-            # due to bug
-            output_files.extend(['output_chat.json'])
+                streaming_chat_client_cmd = chat_client_cmd + [
+                    '--streaming', '-o', 'output_streaming_chat.json'
+                ]
+                check_call(streaming_chat_client_cmd, env=env)

-        if test_desc.startswith("gen_only"):
-            continue
+            # Verify outputs
+            not_expected_strings = ["Berlin Berlin"]

-        for output_file in output_files:
-            with open(output_file, 'r') as f:
-                content = f.read()
-                if "deepseek_v3_lite" in test_desc or output_file == "output_chat.json":
-                    expected_strings = ["Berlin", "Asyncio is a"]
-                else:
-                    expected_strings = [
-                        "The capital of Germany is Berlin",
-                        "Asyncio is a Python library"
-                    ]
-                for expected_string in expected_strings:
-                    assert expected_string in content, f"Expected string '{expected_string}' not found in {output_file}"
-                for not_expected_string in not_expected_strings:
-                    assert not_expected_string not in content, f"Unexpected string '{not_expected_string}' found in {output_file}"
+            output_files = ['output.json', 'output_streaming.json']
+            if test_desc == "overlap":
+                # Disable streaming chat completion for overlap test
+                # due to bug
+                output_files.extend(['output_chat.json'])
+
+            if test_desc.startswith("gen_only"):
+                continue
+
+            for output_file in output_files:
+                with open(output_file, 'r') as f:
+                    content = f.read()
+                    if "deepseek_v3_lite" in test_desc or output_file == "output_chat.json":
+                        expected_strings = ["Berlin", "Asyncio is a"]
+                    else:
+                        expected_strings = [
+                            "The capital of Germany is Berlin",
+                            "Asyncio is a Python library"
+                        ]
+                    for expected_string in expected_strings:
+                        assert expected_string in content, f"Expected string '{expected_string}' not found in {output_file}"
+                    for not_expected_string in not_expected_strings:
+                        assert not_expected_string not in content, f"Unexpected string '{not_expected_string}' found in {output_file}"

    # Print outputs
    print("------------------")
@ -221,8 +213,6 @@ def run_disaggregated_test(example_dir,
    with open('output_disagg.log', 'r') as f:
        print(f.read())

-    kill_disaggregated_processes()
-

@pytest.mark.parametrize("llama_model_root", ['TinyLlama-1.1B-Chat-v1.0'],
                         indirect=True)
--- a/tests/integration/defs/disaggregated/test_workers.py
+++ b/tests/integration/defs/disaggregated/test_workers.py
@ -9,6 +9,7 @@ from typing import List, Optional, Tuple
 import aiohttp
 import pytest
 import yaml
+from defs.trt_test_alternative import popen
 from transformers import AutoTokenizer

 from tensorrt_llm import logger
@ -53,11 +54,11 @@ def run_disaggregated_workers(
        config_file
    ]
    logger.info(f"Running workers with command: {' '.join(workers_cmd)}")
-    workers_proc = subprocess.Popen(workers_cmd,
-                                    stdout=stdout,
-                                    stderr=subprocess.STDOUT,
-                                    env=env,
-                                    cwd=cwd)
+    workers_proc = popen(workers_cmd,
+                         stdout=stdout,
+                         stderr=subprocess.STDOUT,
+                         env=env,
+                         cwd=cwd)
    return workers_proc, ctx_servers, gen_servers


@ -500,19 +501,18 @@ def load_default_prompts(disaggregated_example_root: str):
@contextlib.contextmanager
 def background_workers(llm_venv, config_file: str, num_ranks: int = None):
    cwd = llm_venv.get_working_directory()
-    log_file = open(os.path.join(cwd, 'output_workers.log'), 'w')
-    workers_proc, ctx_servers, gen_servers = run_disaggregated_workers(
-        config_file=config_file,
-        stdout=log_file,
-        env=llm_venv._new_env,
-        cwd=cwd,
-        num_ranks=num_ranks)
-    try:
-        yield ctx_servers, gen_servers
-    finally:
-        workers_proc.terminate()
-        workers_proc.wait()
-        log_file.close()
+
+    with open(os.path.join(cwd, 'output_workers.log'), 'w') as log_file:
+        workers_proc, ctx_servers, gen_servers = run_disaggregated_workers(
+            config_file=config_file,
+            stdout=log_file,
+            env=llm_venv._new_env,
+            cwd=cwd,
+            num_ranks=num_ranks)
+        with workers_proc as proc:
+            yield ctx_servers, gen_servers
+            proc.terminate()
+            proc.wait()


@pytest.mark.parametrize("llama_model_root", ['TinyLlama-1.1B-Chat-v1.0'],
--- a/tests/integration/defs/test_e2e.py
+++ b/tests/integration/defs/test_e2e.py
@ -741,7 +741,7 @@ def test_trtllm_bench_pytorch_backend_sanity(llm_root, llm_venv,
                                     dir="./",
                                     delete=True,
                                     delete_on_close=True) as running_log:
-        check_call(benchmark_cmd, shell=True, running_log=running_log)
+        check_call(benchmark_cmd, shell=True, stdout=running_log)
        if model_id in mapping and not use_extra_config:
            # extra config defines max kv cache tokens number to be 40000 which makes the checking
            # the checking process not unified.
@ -775,7 +775,7 @@ def test_trtllm_bench_mgmn(llm_root, llm_venv):
                                     delete_on_close=True) as running_log:
        check_call(benchmark_cmd,
                   shell=True,
-                   running_log=running_log,
+                   stdout=running_log,
                   env=llm_venv._new_env)
        _check_mem_usage(running_log, [30, 0, 0, 0])

@ -928,7 +928,7 @@ def test_trtllm_bench_iteration_log(llm_root, llm_venv, model_name,
                    dir="./",
                    delete=True,
                    delete_on_close=True) as running_log:
-                check_call(benchmark_cmd, shell=True, running_log=running_log)
+                check_call(benchmark_cmd, shell=True, stdout=running_log)
                _check_mem_usage(running_log, [19.4, 0, 0, 0])
        else:
            check_call(benchmark_cmd, shell=True)
@ -1454,7 +1454,7 @@ def test_ptp_quickstart(llm_root, llm_venv):
                                     delete=True,
                                     delete_on_close=True) as running_log:
        venv_check_call(llm_venv, [str(example_root / "quickstart.py")],
-                        running_log=running_log)
+                        stdout=running_log)
        _check_mem_usage(running_log, [4.60, 0, 0, 0])


@ -1476,6 +1476,9 @@ def test_ptp_quickstart(llm_root, llm_venv):
    pytest.param('Llama3.1-70B-FP8',
                 'llama-3.1-model/Llama-3.1-70B-Instruct-FP8',
                 marks=skip_pre_hopper),
+    pytest.param('Nemotron-Super-49B-v1-NVFP4',
+                 'nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf',
+                 marks=skip_pre_hopper),
    pytest.param('Nemotron-Super-49B-v1-FP8',
                 'nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8',
                 marks=skip_pre_hopper),
@ -1517,7 +1520,7 @@ def test_ptp_quickstart_advanced(llm_root, llm_venv, model_name, model_path):
            ]
            if "Qwen3" in model_name:
                cmds.append(f"--kv_cache_fraction=0.6")
-            llm_venv.run_cmd(cmds, running_log=running_log)
+            llm_venv.run_cmd(cmds, stdout=running_log)
            if model_name in mapping:
                _check_mem_usage(running_log, [mapping[model_name], 0, 0, 0])

@ -1545,7 +1548,7 @@ def test_ptq_quickstart_advanced_mtp(llm_root, llm_venv, model_name,
                "--model_dir",
                f"{llm_models_root()}/{model_path}",
            ],
-            running_log=running_log)
+            stdout=running_log)
        _check_mem_usage(running_log, [54.50, 0, 0, 0])


@ -1601,7 +1604,7 @@ def test_ptp_quickstart_advanced_eagle3(llm_root, llm_venv, model_name,
            "--disable_kv_cache_reuse",
            "--disable_overlap_scheduler",
        ],
-                         running_log=running_log)
+                         stdout=running_log)
        _check_mem_usage(running_log, [25.2, 0, 0, 0])


@ -1635,7 +1638,7 @@ def test_ptp_quickstart_advanced_deepseek_r1_8gpus(llm_root, llm_venv,
            "--max_seq_len=3000",
            "--disable_kv_cache_reuse",
        ],
-                         running_log=running_log)
+                         stdout=running_log)
        _check_mem_usage(running_log, [106.3, 0, 0, 0], 8)


@ -1675,7 +1678,7 @@ def test_relaxed_acceptance_quickstart_advanced_deepseek_r1_8gpus(
            "--relaxed_topk=10",
            "--relaxed_delta=0.5",
        ],
-                         running_log=running_log)
+                         stdout=running_log)
        _check_mem_usage(running_log, [85.6, 0, 0, 0], 8)
    # TODO: relaxed acceptance is incompatible with attention dp
    # "--enable_attention_dp"
@ -1725,7 +1728,7 @@ def test_ptp_quickstart_advanced_8gpus(llm_root, llm_venv, model_name,
            f"{llm_models_root()}/{model_path}",
            "--tp_size=8",
        ],
-                         running_log=running_log)
+                         stdout=running_log)
        if model_name in mapping:
            _check_mem_usage(running_log, [mapping[model_name], 0, 0, 0], 8)

@ -1768,7 +1771,7 @@ def test_ptp_quickstart_advanced_mixed_precision(llm_root, llm_venv):
            "--model_dir",
            f"{llm_models_root()}/{model_path}",
        ],
-                         running_log=running_log)
+                         stdout=running_log)
        _check_mem_usage(running_log, [12.0, 0, 0, 0])


@ -1959,7 +1962,7 @@ def test_ptp_quickstart_multimodal(llm_root, llm_venv, model_name, model_path,
            "--media",
            *functionality_inputs[modality]["media"],
        ],
-                         running_log=running_log)
+                         stdout=running_log)

        if model_name in mapping:
            peak, fraction = mapping[model_name]
--- a/tests/integration/defs/triton_server/build_engines.py
+++ b/tests/integration/defs/triton_server/build_engines.py
@ -894,7 +894,8 @@ def prepare_gpt_2b_lora_engine(type, tensorrt_llm_gpt_example_root,
    return engine_dir


-def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
+def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root,
+                            tensorrt_llm_example_root):
    # Build GPT
    if type == "python_backend":
        engine_dir = os.path.join(tensorrt_llm_gpt_example_root, "engine_dir",
@ -904,8 +905,7 @@ def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
                                  "gpt_175b_ifb")

    convert_cmd = [
-        "python3",
-        f"{tensorrt_llm_gpt_example_root}/../generate_checkpoint_config.py",
+        "python3", f"{tensorrt_llm_example_root}/generate_checkpoint_config.py",
        f"--output_path={engine_dir}/ckpt_config.json",
        "--architecture=GPTForCausalLM", "--dtype=float16",
        "--num_hidden_layers=96", "--num_attention_heads=96",
@ -948,7 +948,8 @@ def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
    return engine_dir


-def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root):
+def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root,
+                                  tensorrt_llm_example_root):
    # Build GPT
    if type == "python_backend":
        engine_dir = os.path.join(tensorrt_llm_gpt_example_root, "engine_dir",
@ -958,8 +959,7 @@ def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root):
                                  "gpt_multi_node_ifb")

    convert_cmd = [
-        "python3",
-        f"{tensorrt_llm_gpt_example_root}/../generate_checkpoint_config.py",
+        "python3", f"{tensorrt_llm_example_root}/generate_checkpoint_config.py",
        f"--output_path={engine_dir}/ckpt_config.json",
        "--architecture=GPTForCausalLM", "--dtype=float16",
        "--num_hidden_layers=96", "--num_attention_heads=96",
@ -1111,7 +1111,8 @@ def prepare_llama_v2_13b_engine(tensorrt_llm_llama_example_root,
    return engine_dir


-def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
+def prepare_llama_v3_8b_engine(tensorrt_llm_example_root,
+                               tensorrt_llm_llama_example_root,
                               llama_v3_8b_model_root,
                               workers=8,
                               data_type="bfloat16"):
@ -1133,7 +1134,7 @@ def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
    elif data_type == "fp8":
        convert_cmd = [
            "python3",
-            "../quantization/quantize.py",
+            f"{tensorrt_llm_example_root}/quantization/quantize.py",
            f"--model_dir={llama_v3_8b_model_root}",
            "--dtype=float16",
            "--qformat=fp8",
@ -1186,6 +1187,7 @@ def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,


 def prepare_llama_v3_70b_engine(type,
+                                tensorrt_llm_example_root,
                                tensorrt_llm_llama_example_root,
                                llama_v3_70b_model_root,
                                data_type="bfloat16"):
@ -1211,7 +1213,7 @@ def prepare_llama_v3_70b_engine(type,
    elif data_type == "fp8":
        convert_cmd = [
            "python3",
-            "../quantization/quantize.py",
+            f"{tensorrt_llm_example_root}/quantization/quantize.py",
            f"--model_dir={llama_v3_70b_model_root}",
            "--dtype=float16",
            "--qformat=fp8",
@ -1707,7 +1709,8 @@ def prepare_tiny_llama_1b_engine(type, tensorrt_llm_llama_example_root,
    return engine_dir, xgrammar_tokenizer_info_path


-def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_mixtral_example_root,
+def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_example_root,
+                                      tensorrt_llm_mixtral_example_root,
                                      mixtral_8x7b_v0_1_model_root,
                                      llm_backend_root):
    engine_dir = os.path.join(tensorrt_llm_mixtral_example_root, "engine_dir",
@ -1718,7 +1721,7 @@ def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_mixtral_example_root,
    # Quantize model
    quantize_cmd = [
        "python3",
-        "../quantization/quantize.py",
+        f"{tensorrt_llm_example_root}/quantization/quantize.py",
        f"--model_dir={mixtral_8x7b_v0_1_model_root}",
        "--dtype=float16",
        "--qformat=fp8",
--- a/tests/integration/defs/triton_server/rcca/bug_4323566/inflight_batcher_llm_client_with_end_id.py
+++ b/tests/integration/defs/triton_server/rcca/bug_4323566/inflight_batcher_llm_client_with_end_id.py
@ -0,0 +1,394 @@
+#!/usr/bin/env python
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import queue
+import sys
+import time
+from functools import partial
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+#
+# Simple streaming client for TRT-LLM inflight bacthing backend
+#
+# In order for this code to work properly, config.pbtxt must contain these values:
+#
+# model_transaction_policy {
+#   decoupled: True
+# }
+#
+# parameters: {
+#   key: "gpt_model_type"
+#   value: {
+#     string_value: "inflight_batching"
+#   }
+# }
+#
+# In order for gpt_model_type 'inflight_batching' to work, you must copy engine from
+#
+# tensorrt_llm/cpp/tests/resources/models/rt_engine/gpt2/fp16-inflight-batching-plugin/1-gpu/
+#
+
+
+class UserData:
+
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def prepare_inputs(input_ids_data, input_lengths_data, request_output_len_data,
+                   beam_width_data, temperature_data, streaming_data, end_id):
+
+    inputs = [
+        grpcclient.InferInput('input_ids', [1, 12], "INT32"),
+        grpcclient.InferInput('input_lengths', [1, 1], "INT32"),
+        grpcclient.InferInput('request_output_len', [1, 1], "UINT32"),
+        grpcclient.InferInput('beam_width', [1, 1], "UINT32"),
+        grpcclient.InferInput('temperature', [1, 1], "FP32"),
+        grpcclient.InferInput('streaming', [1, 1], "BOOL"),
+        grpcclient.InferInput('end_id', [1, 1], "UINT32"),
+    ]
+
+    inputs[0].set_data_from_numpy(input_ids_data)
+    inputs[1].set_data_from_numpy(input_lengths_data)
+    inputs[2].set_data_from_numpy(request_output_len_data)
+    inputs[3].set_data_from_numpy(beam_width_data)
+    inputs[4].set_data_from_numpy(temperature_data)
+    inputs[5].set_data_from_numpy(streaming_data)
+    inputs[6].set_data_from_numpy(end_id)
+
+    return inputs
+
+
+def prepare_stop_signals():
+
+    inputs = [
+        grpcclient.InferInput('input_ids', [1, 1], "INT32"),
+        grpcclient.InferInput('input_lengths', [1, 1], "INT32"),
+        grpcclient.InferInput('request_output_len', [1, 1], "UINT32"),
+        grpcclient.InferInput('stop', [1, 1], "BOOL"),
+    ]
+
+    inputs[0].set_data_from_numpy(np.empty([1, 1], dtype=np.int32))
+    inputs[1].set_data_from_numpy(np.zeros([1, 1], dtype=np.int32))
+    inputs[2].set_data_from_numpy(np.array([[0]], dtype=np.uint32))
+    inputs[3].set_data_from_numpy(np.array([[True]], dtype='bool'))
+
+    return inputs
+
+
+# Define the callback function. Note the last two parameters should be
+# result and error. InferenceServerClient would povide the results of an
+# inference as grpcclient.InferResult in result. For successful
+# inference, error will be None, otherwise it will be an object of
+# tritonclientutils.InferenceServerException holding the error details
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8001",
+        help="Inference server URL. Default is localhost:8001.",
+    )
+    parser.add_argument(
+        "-s",
+        "--ssl",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable SSL encrypted channel to the server",
+    )
+    parser.add_argument(
+        "-t",
+        "--stream-timeout",
+        type=float,
+        required=False,
+        default=None,
+        help="Stream timeout in seconds. Default is None.",
+    )
+    parser.add_argument(
+        "-r",
+        "--root-certificates",
+        type=str,
+        required=False,
+        default=None,
+        help="File holding PEM-encoded root certificates. Default is None.",
+    )
+    parser.add_argument(
+        "-p",
+        "--private-key",
+        type=str,
+        required=False,
+        default=None,
+        help="File holding PEM-encoded private key. Default is None.",
+    )
+    parser.add_argument(
+        "-x",
+        "--certificate-chain",
+        type=str,
+        required=False,
+        default=None,
+        help="File holding PEM-encoded certificate chain. Default is None.",
+    )
+    parser.add_argument(
+        "-C",
+        "--grpc-compression-algorithm",
+        type=str,
+        required=False,
+        default=None,
+        help=
+        "The compression algorithm to be used when sending request to server. Default is None.",
+    )
+    parser.add_argument(
+        "-S",
+        "--streaming",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable streaming mode. Default is False.",
+    )
+    parser.add_argument(
+        "-c",
+        "--check-output",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable check of output ids for CI",
+    )
+
+    parser.add_argument(
+        "-b",
+        "--beam-width",
+        required=False,
+        type=int,
+        default=1,
+        help="Beam width value",
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        required=False,
+        default=1.0,
+        help="temperature value",
+    )
+    parser.add_argument(
+        "--request-output-len",
+        type=int,
+        required=False,
+        default=16,
+        help="temperature value",
+    )
+    parser.add_argument(
+        '--stop-after-ms',
+        type=int,
+        required=False,
+        default=0,
+        help='Early stop the generation after a few milliseconds')
+
+    FLAGS = parser.parse_args()
+
+    print('=========')
+    input_ids = [[
+        28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257
+    ]]
+    input_ids_data = np.array(input_ids, dtype=np.int32)
+    input_lengths = [[len(ii)] for ii in input_ids]
+    input_lengths_data = np.array(input_lengths, dtype=np.int32)
+    request_output_len = [[FLAGS.request_output_len]]
+    request_output_len_data = np.array(request_output_len, dtype=np.uint32)
+    beam_width = [[FLAGS.beam_width]]
+    beam_width_data = np.array(beam_width, dtype=np.uint32)
+    temperature = [[FLAGS.temperature]]
+    temperature_data = np.array(temperature, dtype=np.float32)
+    streaming = [[FLAGS.streaming]]
+    streaming_data = np.array(streaming, dtype=bool)
+    end_id = np.array([[6303]], dtype=np.uint32)
+
+    inputs = prepare_inputs(input_ids_data, input_lengths_data,
+                            request_output_len_data, beam_width_data,
+                            temperature_data, streaming_data, end_id)
+
+    if FLAGS.stop_after_ms > 0:
+        stop_inputs = prepare_stop_signals()
+    else:
+        stop_inputs = None
+
+    request_id = "12345"
+    import random
+    request_id = str(random.randint(3, 9000))
+
+    expected_output_ids = [
+        input_ids[0] + [
+            21221, 290, 257, 4255, 379, 262, 1957, 7072, 11, 4689, 347, 2852,
+            2564, 494, 13, 679
+        ]
+    ]
+    if FLAGS.streaming:
+        actual_output_ids = [input_ids[0]]
+    else:
+        actual_output_ids = []
+
+    user_data = UserData()
+    with grpcclient.InferenceServerClient(
+            url=FLAGS.url,
+            verbose=FLAGS.verbose,
+            ssl=FLAGS.ssl,
+            root_certificates=FLAGS.root_certificates,
+            private_key=FLAGS.private_key,
+            certificate_chain=FLAGS.certificate_chain,
+    ) as triton_client:
+        try:
+
+            if FLAGS.streaming:
+
+                # Establish stream
+                triton_client.start_stream(
+                    callback=partial(callback, user_data),
+                    stream_timeout=FLAGS.stream_timeout,
+                )
+                # Send request
+                triton_client.async_stream_infer(
+                    'tensorrt_llm',
+                    inputs,
+                    request_id=request_id,
+                )
+
+                if stop_inputs is not None:
+
+                    time.sleep(FLAGS.stop_after_ms / 1000.0)
+
+                    triton_client.async_stream_infer(
+                        'tensorrt_llm',
+                        stop_inputs,
+                        request_id=request_id,
+                        parameters={'Streaming': FLAGS.streaming})
+
+                #Wait for server to close the stream
+                triton_client.stop_stream()
+
+                # Parse the responses
+                while True:
+                    try:
+                        result = user_data._completed_requests.get(block=False)
+                    except Exception:
+                        break
+
+                    if type(result) == InferenceServerException:
+                        print("Received an error from server:")
+                        print(result)
+                    else:
+                        output_ids = result.as_numpy('output_ids')
+
+                        if output_ids is not None:
+                            if (FLAGS.streaming):
+                                # Only one beam is supported
+                                tokens = list(output_ids[0][0])
+                                actual_output_ids[
+                                    0] = actual_output_ids[0] + tokens
+                            else:
+                                for beam_output_ids in output_ids[0]:
+                                    tokens = list(beam_output_ids)
+                                    actual_output_ids.append(tokens)
+                        else:
+                            print("Got cancellation response from server")
+            else:
+                # Send request
+                triton_client.async_infer(
+                    'tensorrt_llm',
+                    inputs,
+                    request_id=request_id,
+                    callback=partial(callback, user_data),
+                    parameters={'Streaming': FLAGS.streaming})
+
+                if stop_inputs is not None:
+
+                    time.sleep(FLAGS.stop_after_ms / 1000.0)
+
+                    triton_client.async_infer(
+                        'tensorrt_llm',
+                        stop_inputs,
+                        request_id=request_id,
+                        callback=partial(callback, user_data),
+                        parameters={'Streaming': FLAGS.streaming})
+
+                processed_count = 0
+                expected_responses = 1 + (1 if stop_inputs is not None else 0)
+                while processed_count < expected_responses:
+                    try:
+                        result = user_data._completed_requests.get()
+                        print("Got completed request", flush=True)
+                    except Exception:
+                        break
+
+                    if type(result) == InferenceServerException:
+                        print("Received an error from server:")
+                        print(result)
+                    else:
+                        output_ids = result.as_numpy('output_ids')
+                        if output_ids is not None:
+                            for beam_output_ids in output_ids[0]:
+                                tokens = list(beam_output_ids)
+                                actual_output_ids.append(tokens)
+                        else:
+                            print("Got response for cancellation request")
+
+                    processed_count = processed_count + 1
+        except Exception as e:
+            print("channel creation failed: " + str(e))
+            sys.exit()
+
+        passed = True
+
+        print("output_ids = ", actual_output_ids)
+        if (FLAGS.check_output):
+            passed = (actual_output_ids == expected_output_ids)
+            print("expected_output_ids = ", expected_output_ids)
+            print("\n=====")
+            print("PASS!" if passed else "FAIL!")
+            print("=====")
+
+        sys.exit(not passed)
--- a/tests/integration/defs/triton_server/test_triton_llm.py
+++ b/tests/integration/defs/triton_server/test_triton_llm.py
@ -2145,6 +2145,7 @@ def test_llama_v3_speculative_decoding_bls(
    tensorrt_llm_llama_example_root,
    llama_v3_8b_model_root,
    llama_v3_70b_model_root,
+    tensorrt_llm_example_root,
    llm_backend_inflight_batcher_llm_root,
    llm_backend_dataset_root,
    llm_backend_venv,
@ -2161,16 +2162,19 @@ def test_llama_v3_speculative_decoding_bls(
    llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
    # Build engine
    DRAFT_ENGINE_DIR = prepare_llama_v3_8b_engine(
+        tensorrt_llm_example_root,
        tensorrt_llm_llama_example_root,
        llama_v3_8b_model_root,
        data_type=DATA_TYPE)
    CONTROL_ENGINE_DIR = prepare_llama_v3_70b_engine(
        "control_ifb",
+        tensorrt_llm_example_root,
        tensorrt_llm_llama_example_root,
        llama_v3_70b_model_root,
        data_type=DATA_TYPE)
    TARGET_ENGINE_DIR = prepare_llama_v3_70b_engine(
        "target_ifb",
+        tensorrt_llm_example_root,
        tensorrt_llm_llama_example_root,
        llama_v3_70b_model_root,
        data_type=DATA_TYPE)
@ -2310,6 +2314,7 @@ def test_gpt_175b_dummyWeights_ifb(
    EXCLUDE_INPUT_IN_OUTPUT,
    inflight_batcher_llm_client_root,
    tensorrt_llm_gpt_example_root,
+    tensorrt_llm_example_root,
    gpt_tokenizer_model_root,
    llm_backend_venv,
 ):
@ -2321,7 +2326,8 @@ def test_gpt_175b_dummyWeights_ifb(

    llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
    # Build Engine
-    ENGINE_PATH = prepare_gpt_175b_engine("ifb", tensorrt_llm_gpt_example_root)
+    ENGINE_PATH = prepare_gpt_175b_engine("ifb", tensorrt_llm_gpt_example_root,
+                                          tensorrt_llm_example_root)
    # Prepare model repo
    new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
    prepare_ib_model_repo(llm_backend_repo_root, new_model_repo)
--- a/tests/integration/defs/triton_server/test_triton_memleak.py
+++ b/tests/integration/defs/triton_server/test_triton_memleak.py
@ -86,7 +86,8 @@ def test_valgrind_llama_v2_13b(

    llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
    # Build engine
-    ENGINE_PATH = prepare_llama_v2_13b_engine(tensorrt_llm_llama_example_root,
+    ENGINE_PATH = prepare_llama_v2_13b_engine(tensorrt_llm_example_root,
+                                              tensorrt_llm_llama_example_root,
                                              llama_v2_tokenizer_model_root)

    # Prepare model repo
--- a/tests/integration/defs/triton_server/test_triton_multi_node.py
+++ b/tests/integration/defs/triton_server/test_triton_multi_node.py
@ -10,6 +10,7 @@ from .common import *
@pytest.mark.skip_less_device_memory(80000)
 def test_gpt175b_dummyWeights_multi_node_engine_config(
    tensorrt_llm_gpt_example_root,
+    tensorrt_llm_example_root,
    gpt_tokenizer_model_root,
 ):
    ACCUMULATE_TOKEN = "False"
@ -36,7 +37,8 @@ def test_gpt175b_dummyWeights_multi_node_engine_config(
    llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
    # Build Engine
    ENGINE_PATH = prepare_gpt_multi_node_engine("ifb",
-                                                tensorrt_llm_gpt_example_root)
+                                                tensorrt_llm_gpt_example_root,
+                                                tensorrt_llm_example_root)
    # Prepare model repo
    new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
    prepare_ib_model_repo(llm_backend_repo_root, new_model_repo)
--- a/tests/integration/defs/triton_server/test_triton_rcca.py
+++ b/tests/integration/defs/triton_server/test_triton_rcca.py
@ -42,7 +42,7 @@ def get_rcca_path():
@pytest.mark.parametrize("KV_CACHE_FREE_GPU_MEM_FRACTION", [""])
@pytest.mark.parametrize("ENABLE_TRT_OVERLAP", ["False"],
                         ids=["disableTrtOverlap"])
-@pytest.mark.parametrize("BATCHING_STRATEGY", ["V1"])
+@pytest.mark.parametrize("BATCHING_STRATEGY", ["inflight_fused_batching"])
@pytest.mark.parametrize("DECOUPLED_MODE", ["False"],
                         ids=["disableDecoupleMode"])
@pytest.mark.parametrize("TRITON_MAX_BATCH_SIZE", ["128"])
@ -618,6 +618,7 @@ def test_rcca_bug_4714193(
    TOP_K,
    TOP_P,
    TEMPERATURE,
+    tensorrt_llm_example_root,
    tensorrt_llm_mixtral_example_root,
    mixtral_8x7b_v0_1_model_root,
    llm_backend_root,
@ -631,8 +632,8 @@ def test_rcca_bug_4714193(
    llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
    # Build engine
    ENGINE_PATH = prepare_rcca_nvbug_4714193_engine(
-        tensorrt_llm_mixtral_example_root, mixtral_8x7b_v0_1_model_root,
-        llm_backend_root)
+        tensorrt_llm_example_root, tensorrt_llm_mixtral_example_root,
+        mixtral_8x7b_v0_1_model_root, llm_backend_root)

    # Prepare model repo
    new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
--- a/tests/integration/defs/trt_test_alternative.py
+++ b/tests/integration/defs/trt_test_alternative.py
@ -6,7 +6,8 @@ import platform
 import signal
 import subprocess
 import sys
-import tempfile
+import time
+import warnings

 import psutil

@ -68,7 +69,9 @@ if is_linux():

        return pids

-    def cleanup_process_tree(p: subprocess.Popen, has_session=False):
+    def cleanup_process_tree(p: subprocess.Popen,
+                             has_session=False,
+                             verbose_message=False):
        target_pids = set()
        if has_session:
            # Session ID is the pid of the leader process
@ -82,8 +85,30 @@ if is_linux():
        except psutil.Error:
            pass

-        print("Found leftover pids:", target_pids)
-        for pid in target_pids:
+        persist_pids = []
+        if target_pids:
+            # Grace period
+            time.sleep(5)
+
+            lines = []
+            for pid in sorted(target_pids):
+                try:
+                    sp = psutil.Process(pid)
+                    if verbose_message:
+                        cmdline = sp.cmdline()
+                        lines.append(f"{pid}: {cmdline}")
+                    persist_pids.append(pid)
+                except psutil.Error:
+                    pass
+
+            if persist_pids:
+                msg = f"Found leftover subprocesses: {persist_pids} launched by {p.args}"
+                if verbose_message:
+                    detail = '\n'.join(lines)
+                    msg = f"{msg}\n{detail}"
+                warnings.warn(msg)
+
+        for pid in persist_pids:
            try:
                os.kill(pid, signal.SIGKILL)
            except (ProcessLookupError, PermissionError):
@ -148,6 +173,29 @@ elif is_windows():
        p.kill()


+@contextlib.contextmanager
+def popen(*popenargs,
+          start_new_session=True,
+          suppress_output_info=False,
+          **kwargs):
+    if not suppress_output_info:
+        print(f"Start subprocess with popen({popenargs}, {kwargs})")
+
+    with Popen(*popenargs, start_new_session=start_new_session, **kwargs) as p:
+        try:
+            yield p
+            if start_new_session:
+                cleanup_process_tree(p, True, True)
+        except Exception as e:
+            cleanup_process_tree(p, start_new_session)
+            if isinstance(e, subprocess.TimeoutExpired):
+                print("Process timed out.")
+                stdout, stderr = p.communicate()
+                e.output = stdout
+                e.stderr = stderr
+            raise
+
+
 def call(*popenargs,
         timeout=None,
         start_new_session=True,
@ -155,31 +203,11 @@ def call(*popenargs,
         **kwargs):
    if not suppress_output_info:
        print(f"Start subprocess with call({popenargs}, {kwargs})")
-
-    running_log = None
-    if "running_log" in kwargs:
-        if isinstance(kwargs["running_log"], tempfile._TemporaryFileWrapper):
-            running_log = kwargs["running_log"]
-        kwargs.pop("running_log", 'Not Found')
-    with Popen(*popenargs,
+    with popen(*popenargs,
               start_new_session=start_new_session,
-               stdout=running_log,
+               suppress_output_info=True,
               **kwargs) as p:
-        try:
-            retcode = p.wait(timeout=timeout)
-            if retcode and start_new_session:
-                cleanup_process_tree(p, True)
-            return retcode
-        except Exception as e:
-            if isinstance(e, subprocess.TimeoutExpired):
-                print("Process timed out.")
-                stdout, stderr = p.communicate()
-                if stdout:
-                    print("STDOUT:", stdout.decode('utf-8', errors='replace'))
-                if stderr:
-                    print("STDERR:", stderr.decode('utf-8', errors='replace'))
-            cleanup_process_tree(p, start_new_session)
-            raise
+        return p.wait(timeout=timeout)


 def check_call(*popenargs, **kwargs):
@ -212,9 +240,9 @@ def check_output(*popenargs, timeout=None, start_new_session=True, **kwargs):
            cleanup_process_tree(process, start_new_session)
            raise
        retcode = process.poll()
+        if start_new_session:
+            cleanup_process_tree(process, True, True)
        if retcode:
-            if start_new_session:
-                cleanup_process_tree(process, True)
            raise subprocess.CalledProcessError(retcode,
                                                process.args,
                                                output=stdout,
--- a/tests/integration/test_lists/qa/examples_test_list.txt
+++ b/tests/integration/test_lists/qa/examples_test_list.txt
@ -375,6 +375,14 @@ accuracy/test_cli_flow.py::TestLlama3_2_1B::test_fp8_rowwise
 accuracy/test_cli_flow.py::TestLlama3_2_1B::test_weight_streaming[1.0]
 accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache
 accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache_beam_search
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_smooth_quant
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_smooth_quant_ootb
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_int4_awq
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_int4_awq_int8_kv_cache
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_pp2
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_rowwise
 accuracy/test_cli_flow.py::TestMistral7B::test_beam_search
 accuracy/test_cli_flow.py::TestMistral7B::test_fp8_tp4pp2
 accuracy/test_cli_flow.py::TestMistral7B::test_smooth_quant_tp4pp1
@ -425,7 +433,7 @@ accuracy/test_llm_api.py::TestMixtral8x7B::test_tp2
 accuracy/test_llm_api.py::TestMixtral8x7B::test_smooth_quant_tp2pp2
 accuracy/test_llm_api.py::TestMixtral8x7BInstruct::test_awq_tp2
 accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4
-accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_decoder
+accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
 accuracy/test_llm_api_pytorch.py::TestLlama3_3_70BInstruct::test_fp8_tp4
 accuracy/test_llm_api_pytorch.py::TestLlama3_3_70BInstruct::test_nvfp4_tp4
 accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
@ -445,8 +453,16 @@ accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp_
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
 accuracy/test_llm_api_pytorch.py::TestMinitron4BBaseInstruct::test_fp8_prequantized
 accuracy/test_llm_api_pytorch.py::TestNemotronNas::test_auto_dtype_tp8
-accuracy/test_llm_api_pytorch.py::TestNemotronSuper::test_auto_dtype_tp2
+accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
+accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2
+accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
+accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2
 accuracy/test_llm_api_pytorch.py::TestNemotronNano::test_auto_dtype
+accuracy/test_cli_flow.py::TestNemotronNano::test_auto_dtype
+accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
+accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
+accuracy/test_cli_flow.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
+accuracy/test_cli_flow.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
 accuracy/test_llm_api_pytorch.py::TestNemotronH::test_auto_dtype
 accuracy/test_llm_api_pytorch.py::TestQwen2_7BInstruct::test_auto_dtype
 accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_8gpus[latency]
--- a/tests/integration/test_lists/qa/llm_release_rtx_pro_6000.txt
+++ b/tests/integration/test_lists/qa/llm_release_rtx_pro_6000.txt
@ -24,6 +24,7 @@ test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-FP8-llama-3.1-model/Llama-
 test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-8B]
 test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-70B]
 test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-FP8-llama-3.1-model/Llama-3.1-70B-Instruct-FP8]
+test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-NVFP4-nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf]
 test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-FP8-nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8]
 test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-NVFP4-nvfp4-quantized/Mixtral-8x7B-Instruct-v0.1]
 test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-FP8-Mixtral-8x7B-Instruct-v0.1-fp8]
--- a/tests/integration/test_lists/qa/llm_sanity_test.txt
+++ b/tests/integration/test_lists/qa/llm_sanity_test.txt
@ -101,6 +101,7 @@ accuracy/test_cli_flow.py::TestLlama3_1_8B::test_fp8_rowwise_tp4[disable_gemm_al
 accuracy/test_cli_flow.py::TestLlama3_1_8B::test_autoq
 accuracy/test_cli_flow.py::TestLlama3_1_8BInstruct::test_medusa_fp8_prequantized
 accuracy/test_cli_flow.py::TestLlama3_2_1B::test_auto_dtype
+accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
 accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
 accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_nvfp4_prequantized_tp4
 accuracy/test_cli_flow.py::TestMistral7B::test_fp8_tp4pp2
@ -120,14 +121,21 @@ accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype[tp8-c
 accuracy/test_llm_api_pytorch.py::TestMixtral8x7B::test_fp8_tp2
 accuracy/test_llm_api_pytorch.py::TestMixtral8x7B::test_nvfp4_tp2
 accuracy/test_llm_api_pytorch.py::TestNemotronNas::test_auto_dtype_tp8
-accuracy/test_llm_api_pytorch.py::TestNemotronSuper::test_auto_dtype_tp2
+accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
+accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
 accuracy/test_llm_api_pytorch.py::TestNemotronNano::test_auto_dtype
+accuracy/test_cli_flow.py::TestNemotronNano::test_auto_dtype
+accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8-cuda_graph=False]
+accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=False]
+accuracy/test_cli_flow.py::TestNemotronUltra::test_auto_dtype[tp8-cuda_graph=False]
+accuracy/test_cli_flow.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=False]
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
 accuracy/test_llm_api_pytorch.py::TestQwen3_8B::test_fp8_block_scales[latency]
 accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8_block_scales[latency]
 accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[latency_moe_cutlass]
+accuracy/test_llm_api_pytorch.py::TestPhi4MiniInstruct::test_auto_dtype

 # Pivot to Pytorch test cases.
 test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-BF16-llama-3.1-model/Meta-Llama-3.1-8B]
--- a/tests/integration/test_lists/qa/trt_llm_release_perf_sanity_test.yml
+++ b/tests/integration/test_lists/qa/trt_llm_release_perf_sanity_test.yml
@ -58,9 +58,6 @@ trt_llm_release_perf_sanity_test:
  # E2E gptManagerBenchmark IFB
  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-cppmanager-exe-static_batching-plugin_ifb-float16-bs:8+64-input_output_len:128,128+512,32]
  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-cppmanager-exe-plugin_ifb-bfloat16-gwp:0.0-input_output_len:128,128+512,32]
-
-  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
-  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:512,32]
  - perf/test_perf.py::test_perf[qwen2_7b_instruct-bench-float16-input_output_len:128,128]

--- a/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml
+++ b/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml
@ -49,8 +49,14 @@ trt_llm_release_perf_test:
  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-bfloat16-maxbs:64-input_output_len:20000,2000-con:250]
  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-bfloat16-maxbs:64-input_output_len:20000,2000-quant:fp8-con:250]
  # pyt backend
-  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-input_output_len:128,128]
-  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-input_output_len:2000,2000]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-maxnt:5000-input_output_len:5000,500-reqs:8-con:1]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:500,2000-reqs:8-con:1]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:1000,1000-reqs:8-con:1]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-maxnt:20000-input_output_len:20000,2000-reqs:8-con:1]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:5000,500-reqs:500-con:250]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:500,2000-reqs:500-con:250]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:1000,1000-reqs:500-con:250]
+  - perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:20000,2000-reqs:500-con:250]

  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
  - perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:512,32]
--- a/tests/integration/test_lists/test-db/l0_dgx_h200.yml
+++ b/tests/integration/test_lists/test-db/l0_dgx_h200.yml
@ -19,5 +19,6 @@ l0_dgx_h200:
  - accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_fp8_blockscale[latency] # 1h
  - accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype[True]
  - accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype[False]
-  - unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-enable_graph-tp8-trtllm-scout]
+  - unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-disable_adp-enable_graph-tp8-trtllm-scout]
+  - unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep4-enable_adp-enable_graph-tp8-trtllm-scout]
  - unittest/llmapi/test_llm_pytorch.py::test_nemotron_nas_lora
--- a/tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
+++ b/tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
@ -25,6 +25,7 @@ l0_rtx_pro_6000:
  - test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-FP8-llama-3.1-model/Llama-3.1-8B-Instruct-FP8]
  - test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-70B]
  - test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-FP8-llama-3.1-model/Llama-3.1-70B-Instruct-FP8]
+  - test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-NVFP4-nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf]
  - test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-FP8-nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8]
  - test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-NVFP4-nvfp4-quantized/Mixtral-8x7B-Instruct-v0.1]
  - test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-FP8-Mixtral-8x7B-Instruct-v0.1-fp8]
--- a/tests/integration/test_lists/waives.txt
+++ b/tests/integration/test_lists/waives.txt
@ -83,7 +83,6 @@ full:B200_PCIe/examples/test_llama.py::test_llm_llama_v2_lora_1gpu[chinese-llama
 full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-mini-128k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
 full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-8k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
 full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
-full:B200_PCIe/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
 full:B200_PCIe/unittest/trt/functional SKIP (Disable for Blackwell)
 full:B200_PCIe/unittest/trt/quantization SKIP (Disable for Blackwell)
 full:B200_PCIe/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
@ -174,7 +173,6 @@ full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-128k
 full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
 full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3-mini-128k-instruct-fp8-float16] SKIP (Disable for Blackwell)
 full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3.5-mini-instruct-fp8-float16] SKIP (Disable for Blackwell)
-full:B200/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
 full:B200/unittest/trt/functional SKIP (Disable for Blackwell)
 full:B200/unittest/trt/quantization SKIP (Disable for Blackwell)
 full:B200/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
@ -330,11 +328,6 @@ full:B200/test_e2e.py::test_ptp_quickstart_advanced[Nemotron4_4B-BF16-nemotron/M
 full:B200/test_e2e.py::test_ptp_scaffolding[DeepSeek-R1-Distill-Qwen-7B-DeepSeek-R1/DeepSeek-R1-Distill-Qwen-7B] SKIP (https://nvbugs/5136994)
 full:B200/test_e2e.py::test_trtllm_bench_pytorch_backend_sanity[meta-llama/Llama-3.1-8B-llama-3.1-8b-hf-nvfp4-False-False] SKIP (https://nvbugs/5136994)
 examples/test_multimodal.py::test_llm_multimodal_general[kosmos-2-pp:1-tp:1-float16-bs:8-cpp_e2e:True-nb:1] SKIP (https://nvbugs/5141288)
-examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2_vl_7b_instruct-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5141290)
-examples/test_qwen.py::test_llm_qwen_single_gpu_summary[qwen2_vl_7b_instruct-enable_paged_kv_cache-enable_remove_input_padding-disable_weight_only-disable_fmha] SKIP (https://nvbugs/5141290)
-examples/test_qwen.py::test_llm_qwen_single_gpu_summary[qwen2_vl_7b_instruct-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha_fp32_acc] SKIP (https://nvbugs/5141290)
-examples/test_qwen.py::test_llm_qwen_awq_single_gpu_summary[qwen2_vl_7b_instruct-nb:4] SKIP (https://nvbugs/5141290)
-examples/test_qwen.py::test_llm_hf_qwen_quantization_1gpu[qwen2_vl_7b_instruct-fp8-bfloat16] SKIP (https://nvbugs/5141290)
 unittest/_torch/auto_deploy/integration/test_lm_eval.py SKIP (https://nvbugs/5144854)
 examples/test_qwen.py::test_llm_qwen1_5_moe_plugin_single_gpu_lora[qwen1.5_moe_a2.7b_chat-Upcycled-Qwen1.5-MoE2.7B-LoRA] SKIP (https://nvbugs/5155141)

@ -368,7 +361,6 @@ full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:w4
 full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:int8_sq_per_tensor] SKIP (https://nvbugspro.nvidia.com/bug/5161074)
 full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:int8_sq_per_token_channel] SKIP (https://nvbugspro.nvidia.com/bug/5161074)
 examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_cpp_session-recurrentgemma-2b-use_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5174573)
-examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (https://nvbugs/5180961)
 examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-no_paged_cache-disable_quant-float16-disable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
 examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-no_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
 examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-use_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
@ -401,6 +393,9 @@ perf/test_perf.py::test_perf[t5-bench-float16-input_output_len:128,20-gpus:2] SK
 perf/test_perf.py::test_perf[t5-bench-float16-maxbs:1-input_output_len:128,20-gpus:2] SKIP
 perf/test_perf.py::test_perf[gpt_20b-bench-float16-maxbs:8-input_output_len:128,128-reqs:80-gpus:8] SKIP
 perf/test_perf.py::test_perf[gpt_20b-bench-float16-maxbs:8-input_output_len:512,32-reqs:80-gpus:8] SKIP
+full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:512-input_output_len:128,128-ep:8-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
+full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:1-input_output_len:1000,2000-reqs:10-ep:4-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
+full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:384-maxnt:1536-input_output_len:1000,2000-reqs:49152-con:3072-ep:8-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
 full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[deepseek_v3_lite_fp8-bench-pytorch-float8-input_output_len:128,128] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
 full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[mixtral_8x7b_v0.1_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-tp:2-gpus:2] SKIP #https://docs.google.com/spreadsheets/d/1EvwCcJ5o2zmhVxFFxAAz-49UzswMlfN2y5K37Fkyw7A/edit?gid=907483661#gid=907483661
 full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[llama_v3.3_nemotron_49b-bench-pytorch-bfloat16-input_output_len:128,128-tp:2-gpus:2] SKIP #https://docs.google.com/spreadsheets/d/1EvwCcJ5o2zmhVxFFxAAz-49UzswMlfN2y5K37Fkyw7A/edit?gid=907483661#gid=907483661
@ -413,6 +408,7 @@ accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache SKIP (https://n
 test_e2e.py::test_ptp_quickstart_multimodal[NVILA-8B-FP16-vila/NVILA-8B-image] SKIP (https://nvbugs/5233423)
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[ep4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
+accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[tp4-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False] SKIP (https://nvbugs/5294983)
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[tp4-mtp_nextn=2-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[ep4-mtp_nextn=2-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
 accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False] SKIP (https://nvbugs/5234002)
@ -426,13 +422,13 @@ examples/test_bert.py::test_llm_bert_general[compare_hf-enable_remove_input_padd
 examples/test_bert.py::test_llm_bert_general[compare_hf-enable_remove_input_padding-use_attention_plugin-enable_context_fmha-tp:2-pp:1-float16-RobertaForQuestionAnswering-bert/roberta-base-squad2] SKIP (https://nvbugs/5234058)
 disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0] SKIP (https://nvbugs/5247271)
 disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_tp1_attention_dp_overlap_one_mtp[DeepSeek-V3-Lite-fp8] SKIP (https://nvbugspro.nvidia.com/bug/5273945)
-unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
+unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-disable_adp-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
+unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep4-enable_adp-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
 accuracy/test_cli_flow.py::TestLlama3_1_8B::test_tp4[enable_gemm_allreduce_plugin] SKIP (https://nvbugs/5247786)
 full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen1.5_7b_chat-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
 full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2_7b_instruct-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
 full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2.5_7b_chat-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
 full:B200/examples/test_mixtral.py::test_llm_mixtral_pp_reduce_scatter_4gpus[Mixtral-8x7B-v0.1] SKIP (https://nvbugs/5247837)
-examples/test_qwen.py::test_llm_qwen_smooth_quant_single_gpu_summary[qwen2_vl_7b_instruct-enable_ptpc-nb:4] SKIP (https://nvbugs/5273694)
 accuracy/test_cli_flow.py::TestMixtral8x22B::test_int8_plugin_tp8[renormalize-tensor_parallel] SKIP (https://nvbugs/5273695)
 test_e2e.py::test_ptp_quickstart_advanced_8gpus[Nemotron-Ultra-253B-nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1] SKIP (https://nvbugs/5273697)
 examples/test_gpt.py::test_starcoder_fp8_quantization_2gpu[starcoder] SKIP (https://nvbugs/5144931)
@ -440,7 +436,6 @@ examples/test_gpt.py::test_starcoder_fp8_quantization_2gpu[starcoderplus] SKIP (
 unittest/_torch -k "not (modeling or multi_gpu or auto_deploy)" SKIP (https://nvbugs/5280806)
 examples/test_whisper.py::test_llm_whisper_general[large-v3-disable_gemm_plugin-disable_attention_plugin-disable_weight_only-float16-nb:1-use_python_runtime] SKIP (https://nvbugs/5244570)
 unittest/_torch/speculative/test_eagle3.py SKIP (https://nvbugs/5280806)
-test_e2e.py::test_ptp_quickstart_multimodal[qwen2-vl-7b-instruct-Qwen2-VL-7B-Instruct-image] SKIP (https://nvbugs/5226211)
 triton_server/test_triton_rcca.py::test_mistral_beam_search[rcca_4714407-True-10-False-True-False-0-128-disableDecoupleMode-inflight_fused_batching-disableTrtOverlap-guaranteed_no_evict-1-1-1-False-ensemble] SKIP (https://nvbugs/5240060)
 triton_server/test_triton.py::test_triton_extensive[triton-extensive] SKIP
 triton_server/test_triton.py::test_gpt_speculative_decoding[gpt-speculative-decoding] SKIP
--- a/tests/unittest/_torch/multi_gpu_modeling/test_llama4.py
+++ b/tests/unittest/_torch/multi_gpu_modeling/test_llama4.py
@ -19,15 +19,22 @@ from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
@pytest.mark.parametrize("tp_size", [1, 8], ids=["tp1", "tp8"])
@pytest.mark.parametrize("use_cuda_graph", [True, False],
                         ids=["enable_graph", "disable_graph"])
+@pytest.mark.parametrize("enable_attention_dp", [True, False],
+                         ids=["enable_adp", "disable_adp"])
@pytest.mark.parametrize("ep_size", [4, 1], ids=["ep4", "ep1"])
@pytest.mark.parametrize("pp_size", [1, 8], ids=["pp1", "pp8"])
-def test_llama4(model_name, backend, tp_size, use_cuda_graph, ep_size, pp_size):
+def test_llama4(model_name, backend, tp_size, use_cuda_graph,
+                enable_attention_dp, ep_size, pp_size):
    if pp_size > 1 and (ep_size > 1 or tp_size > 1):
        return

    if pp_size == 1 and tp_size == 1:
        return

+    if enable_attention_dp and not (tp_size == 8 and ep_size == 4
+                                    and pp_size == 1):
+        pytest.skip("Skip this attention DP test case to avoid too many tests")
+
    prompts = [{
        "prompt": "The president of the United States is"
    }, {
@ -52,6 +59,7 @@ def test_llama4(model_name, backend, tp_size, use_cuda_graph, ep_size, pp_size):
        moe_tensor_parallel_size=tp_size // ep_size,
        pytorch_backend_config=pytorch_config,
        pipeline_parallel_size=pp_size,
+        enable_attention_dp=enable_attention_dp,
    )
    with llm:
        outputs = llm.generate(