Release 0.20 to main (#4577)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
This commit is contained in:
amirkl94 2025-05-28 11:25:33 +03:00 committed by GitHub
parent b800adc65c
commit fbec0c3552
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
45 changed files with 1305 additions and 393 deletions

View File

@ -1,7 +1,7 @@
version: "3.9"
services:
tensorrt_llm-dev:
image: urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400
image: urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539
network_mode: host
ipc: host

View File

@ -1,2 +1,9 @@
# These vulnerabilities were inherited from the base image (pytorch:25.05-py3) and should be removed when the base image
# is updated.
# WAR against https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
h11>=0.16.0
# WAR against https://github.com/advisories/GHSA-7cx3-6m66-7c5m
tornado>=6.5.0
# WAR against https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
setuptools>=78.1.1

View File

@ -72,9 +72,14 @@ RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh
RUN pip3 uninstall -y opencv && rm -rf /usr/local/lib/python3*/dist-packages/cv2/
RUN pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
# WAR against https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
RUN pip3 install --upgrade h11>=0.16 --no-cache-dir
# WARs against security issues inherited from pytorch:25.04
# * https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
# * https://github.com/advisories/GHSA-7cx3-6m66-7c5m
# * https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
RUN pip3 install --upgrade --no-cache-dir \
"h11>=0.16" \
"tornado>=6.5.0" \
"setuptools>=78.1.1,<80"
FROM ${TRITON_IMAGE}:${TRITON_BASE_TAG} AS triton
@ -173,5 +178,9 @@ RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh
FROM release AS tritonrelease
WORKDIR /app/tensorrt_llm
COPY ./triton_backend/ ./triton_backend/
COPY ./triton_backend/all_models ./triton_backend/all_models
COPY ./triton_backend/scripts ./triton_backend/scripts
COPY ./triton_backend/tools ./triton_backend/tools
COPY ./triton_backend/inflight_batcher_llm/scripts ./triton_backend/inflight_batcher_llm/scripts
COPY ./triton_backend/inflight_batcher_llm/client ./triton_backend/inflight_batcher_llm/client
COPY --from=tritonbuild /opt/tritonserver/backends/tensorrtllm /opt/tritonserver/backends/tensorrtllm

View File

@ -0,0 +1,75 @@
(kv-cache-management)=
# KV Cache Management: Pools, Blocks, and Events
This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.
For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).
---
## Hierarchy: Pool, Block, and Page
### **Block**
- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.
### **Page**
- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.
### **Pool**
- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.
### **WindowBlockManager/BlockManager**
TRT-LLM supports 2 complex features related to KV cache management:
1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.
In order to support both of these features, the pool management works as described below.
But in the simple, *most common case*, for most models, where
1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,
only a *single* pool will be created within the structure described below.
#### KV Cache Pool Management
- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.
**Hierarchy Summary:**
- **Pool** (memory buffer for KV data)
- Contains many blocks.
- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
- (Optionally, blocks can be swapped between primary/secondary pools.)
- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.
---
## Events in `KVCacheEventManager`
The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.
### **Types of Events**
- **Created Event:** When pools or blocks are created/allocated.
- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
- **Removed Event:** When a block is removed from the cache (evicted or released).
- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).
### **What Triggers an Event?**
- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).
**In summary:**
An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.
---

View File

@ -104,6 +104,7 @@ Welcome to TensorRT-LLM's Documentation!
advanced/inference-request.md
advanced/lora.md
advanced/expert-parallelism.md
advanced/kv-cache-management.md
advanced/kv-cache-reuse.md
advanced/speculative-decoding.md
advanced/disaggregated-service.md

View File

@ -4,6 +4,8 @@ In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to opt
Since KV Cache requires memory to store, it is also an important resource.
In TensorRT-LLM, KV Cache is managed by the `KVCacheManager`.
For details of the TensorRT-LLM `KVCacheManager` implementation see [KV Cache Management](../advanced/kv-cache-management.md).
## KV Cache Manager Introduction
`KVCacheManager` is a type of resource manager, inheriting from `BaseResourceManager`.

View File

@ -1,24 +0,0 @@
#!/bin/bash
dataset="template_trtllm_openai_completions.json"
output_folder="output_loadgen"
port=8000
host="localhost"
max_count=256
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
streaming="False"
input_tokens=128
output_tokens=128
concurrency=32
infserver_loadgen ${dataset} \
--output_dir "${output_folder}" \
--set dataset.input_tokens:int="${input_tokens}" \
--set dataset.output_tokens:int="${output_tokens}" \
--set dataset.max_count:int="${max_count}" \
--set dataset.model_name:str="${model_name}" \
--set dataset.max_concurrent_requests:int="${concurrency}" \
--set inference_server.host:str="${host}" \
--set inference_server.port:int="${port}" \
--set post_processors[0].model_name:str="${model_name}" \
--set timing_strategy.desired_rps:float="-1" \
--set inference_server.inference_server_config.stream:bool="${streaming}"

View File

@ -1,24 +0,0 @@
{
"dataset": {
"type": "fixed_isl_osl"
},
"inference_server": {
"type": "trtllm_openai_completions",
"host": "test",
"port": null,
"inference_server_config": {
"model_name": "test"
}
},
"timing_strategy": {
"type": "fixed",
"desired_rps": -1
},
"post_processors": [
{
"type": "infbench_summary",
"model_name": "test"
}
],
"timeout": null
}

View File

@ -128,7 +128,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq \
--output_dir ./tmp/llama/7B/trt_engines/weight_only/1-gpu/ \
--gemm_plugin auto
# Build LLaMA 7B using 2-way auto parallelism.
# Build LLaMA 7B using 2-way auto parallelism (deprecated).
python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \
--output_dir ./tllm_checkpoint_1gpu_fp16 \
--dtype float16

View File

@ -30,6 +30,9 @@ from utils import (DEFAULT_HF_MODEL_DIRS, add_common_args, get_beam_width_array,
import tensorrt_llm
import tensorrt_llm.profiler as profiler
from tensorrt_llm._utils import mpi_broadcast, str_dtype_to_torch
from tensorrt_llm.builder import EngineConfig
from tensorrt_llm.functional import RopeEmbeddingUtils, RotaryScalingType
from tensorrt_llm.layers import MropeParams
from tensorrt_llm.logger import logger
from tensorrt_llm.models.qwen.utils import make_context
from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
@ -41,6 +44,42 @@ if PYTHON_BINDINGS:
from prompt_lookup.run_dtm_pld import run_dtm_pld
def ensemble_mrope_params(batch_input_ids, max_position_embeddings,
rotary_embedding_dim, theta):
mrope_params = MropeParams()
batch_size = len(batch_input_ids)
_, rotary_cos_sin = RopeEmbeddingUtils.create_sinusoidal_positions_for_attention_plugin(
num_pos=max_position_embeddings,
dim=rotary_embedding_dim,
theta=1000000.0,
scale_type=RotaryScalingType.mrope,
)
rotary_cos_sin = torch.tensor(rotary_cos_sin).to(batch_input_ids[0].device)
rotary_cos_sin = rotary_cos_sin.reshape(max_position_embeddings,
int(rotary_embedding_dim / 2), 2)
cos_ori = rotary_cos_sin[:, :, 0]
sin_ori = rotary_cos_sin[:, :, 1]
mrope_position_ids_padding = torch.zeros(
(batch_size, max_position_embeddings), dtype=torch.int32)
for i in range(batch_size):
seq_len = batch_input_ids[i].shape[-1]
mrope_position_ids_padding[i, :seq_len] = torch.arange(
seq_len, device=batch_input_ids[i].device)
cos = cos_ori[mrope_position_ids_padding].unsqueeze(-1)
sin = sin_ori[mrope_position_ids_padding].unsqueeze(-1)
mrope_params.mrope_rotary_cos_sin = torch.concatenate(
(cos, sin), axis=-1).reshape(batch_size, -1)
mrope_params.mrope_position_deltas = torch.zeros(
[batch_size, 1], device=batch_input_ids[0].device)
return mrope_params
def main(args):
is_integration_test = os.getenv('INTEGRATION_TEST', '0') == '1'
if is_integration_test:
@ -262,7 +301,19 @@ def main(args):
eval_task=eval_task,
add_special_tokens=add_special_tokens,
min_input_length=min_input_length)
batch_size = len(batch_input_ids)
# Generate mrope params for qwen model
engine_config = EngineConfig.from_json_file(
f"{args.engine_dir}/config.json")
pretrain_config = engine_config.pretrained_config
mrope_params = None
if 'qwen' in model_name.lower():
mrope_params = ensemble_mrope_params(
batch_input_ids,
max_position_embeddings=pretrain_config.max_position_embeddings,
rotary_embedding_dim=pretrain_config.rotary_embedding_dim,
theta=pretrain_config.rotary_base,
)
if batch_size == 0:
return [], [], [], {}
input_lengths = [x.size(0) for x in batch_input_ids]
@ -309,7 +360,8 @@ def main(args):
return_dict=True,
random_seed=random_seed,
medusa_choices=args.medusa_choices,
eagle_choices=args.eagle_choices)
eagle_choices=args.eagle_choices,
mrope_params=mrope_params)
torch.cuda.synchronize()
# Extract a list of tensors of shape beam_width x output_ids.

View File

@ -28,10 +28,10 @@ UPLOAD_PATH = env.uploadPath ? env.uploadPath : "sw-tensorrt-generic/llm-artifac
// Container configuration
// available tags can be found in: https://urm.nvidia.com/artifactory/sw-tensorrt-docker/tensorrt-llm/
// [base_image_name]-[arch]-[os](-[python_version])-[trt_version]-[torch_install_type]-[stage]-[date]-[mr_id]
LLM_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
LLM_SBSA_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-aarch64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py310-trt10.10.0.31-skip-tritondevel-202505191345-4400"
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py312-trt10.10.0.31-skip-tritondevel-202505191345-4400"
LLM_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
LLM_SBSA_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-aarch64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py310-trt10.10.0.31-skip-tritondevel-202505211401-4539"
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py312-trt10.10.0.31-skip-tritondevel-202505211401-4539"
// TODO: Move common variables to an unified location
BUILD_CORES_REQUEST = "8"

View File

@ -1,7 +1,7 @@
import java.lang.InterruptedException
DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
def createKubernetesPodConfig(image)
{

View File

@ -1,6 +1,7 @@
import math
import os
import threading
from itertools import accumulate
from typing import List, Optional, Tuple, Union
import torch
@ -116,6 +117,24 @@ def get_output_info(input: torch.Tensor, dim: int) -> List[int]:
return {'output_shape': output_shape, 'numel_base': numel_base}
def filter_valid_input(
input_list: List[torch.Tensor]
) -> Tuple[List[torch.Tensor], List[bool]]:
func_valid = lambda x: x is not None
valid_list = list(map(func_valid, input_list))
input_list = list(filter(func_valid, input_list))
return input_list, valid_list
def restore_full_output(output_list: List[torch.Tensor],
valid_list: List[bool]) -> List[torch.Tensor]:
index_list = list(accumulate(map(int, valid_list)))
output_list = list(
map(lambda valid, index: output_list[index - 1]
if valid else None, valid_list, index_list))
return output_list
def allgather(
input: Union[torch.Tensor, List[torch.Tensor]],
mapping: Mapping,
@ -155,8 +174,10 @@ def allgather(
if isinstance(input, torch.Tensor):
assert input.shape[dim] == sizes[mapping.tp_rank]
else:
assert all(
[val.shape[dim] == sizes[mapping.tp_rank] for val in input])
assert all([
val.shape[dim] == sizes[mapping.tp_rank] for val in input
if val is not None
])
# 'sizes' is not needed if all inputs in the same TP group have the same shape
for split_size in sizes[1:]:
if split_size != sizes[0]:
@ -170,6 +191,7 @@ def allgather(
output_info = get_output_info(input, dim)
input = input.contiguous().view(-1, output_info['numel_base'])
else:
input, valid = filter_valid_input(input)
torch_op = torch.ops.trtllm.allgather_list
output_info = [get_output_info(val, dim) for val in input]
input = [
@ -202,6 +224,7 @@ def allgather(
convert_output(val, val_info)
for val, val_info in zip(output, output_info)
]
output = restore_full_output(output, valid)
return output
@ -220,7 +243,10 @@ def reducescatter(
if isinstance(input, torch.Tensor):
assert input.shape[dim] == sum_split_size
else:
assert all([val.shape[dim] == sum_split_size for val in input])
assert all([
val.shape[dim] == sum_split_size for val in input
if val is not None
])
# 'sizes' is not needed if all outputs in the same TP group have the same shape
for split_size in sizes[1:]:
if split_size != sizes[0]:
@ -245,6 +271,7 @@ def reducescatter(
output_info = get_output_info(input, dim)
input = convert_input(input, output_info)
else:
input, valid = filter_valid_input(input)
torch_op = torch.ops.trtllm.reducescatter_list
output_info = [get_output_info(val, dim) for val in input]
input = [
@ -265,6 +292,7 @@ def reducescatter(
val.view(val_info['output_shape'])
for val, val_info in zip(output, output_info)
]
output = restore_full_output(output, valid)
return output

View File

@ -1124,19 +1124,13 @@ class FusedMoE(nn.Module):
if self.use_dp and self.parallel_size > 1 and not disable_fp4_allgather(
) and not self.enable_alltoall:
if x_sf is None:
x, token_selected_slots, token_final_scales = allgather(
[x, token_selected_slots, token_final_scales],
self.mapping,
dim=0,
sizes=None if use_dp_padding else all_rank_num_tokens)
else:
# Fp4 gemm has extra scaling factor
x, x_sf, token_selected_slots, token_final_scales = allgather(
[x, x_sf, token_selected_slots, token_final_scales],
self.mapping,
dim=0,
sizes=None if use_dp_padding else all_rank_num_tokens)
x, x_sf, token_selected_slots, token_final_scales = allgather(
[x, x_sf, token_selected_slots, token_final_scales],
self.mapping,
dim=0,
sizes=None if use_dp_padding else all_rank_num_tokens)
# Fp4 gemm has extra scaling factor
if x_sf is not None:
x_sf = reswizzle_sf(x_sf, x_row, x_col,
self.scaling_vector_size)

View File

@ -149,6 +149,9 @@ def infer_builder_flags(network):
def auto_parallel(network: Network, config: AutoParallelConfig):
logger.warning(
"auto_parallel is deprecated, "
"please use explicit parallelism like tp_size/pp_size instead.")
debug_mode = config.debug_mode
memory_budget = config.get_cluster_info(
).memory_budget_per_device * 1024 * 1024 * 1024

View File

@ -1359,11 +1359,19 @@ class BaseLlmArgs(BaseModel):
class TrtLlmArgs(BaseLlmArgs):
auto_parallel: bool = Field(default=False,
description="Enable auto parallel mode.")
auto_parallel: bool = Field(
default=False,
description="Enable auto parallel mode.",
deprecated=
"Use tensor_parallel_size/pipeline_parallel_size/xxx_parallel_size instead.",
)
auto_parallel_world_size: Optional[int] = Field(
default=None, description="The world size for auto parallel mode.")
default=None,
description="The world size for auto parallel mode.",
deprecated=
"Use tensor_parallel_size/pipeline_parallel_size/xxx_parallel_size instead.",
)
enable_tqdm: bool = Field(default=False,
description="Enable tqdm for progress bar.")

View File

@ -434,6 +434,9 @@ class CliFlowAccuracyTestHarness:
f"--dtype={self.dtype}",
]
if "nemotron_nas" in self.EXAMPLE_FOLDER:
convert_cmd.append("--trust_remote_code")
if self.MODEL_FORMAT == "NEMO":
convert_cmd.append(f"--nemo_ckpt_path={self.MODEL_PATH}")
else:

View File

@ -137,6 +137,8 @@ meta-llama/Llama-3.2-1B:
- quant_algo: FP8
kv_cache_quant_algo: FP8
accuracy: 27.029
- quant_algo: FP8
accuracy: 27.029
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
accuracy: 27.257
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
@ -310,5 +312,3 @@ Qwen3/Qwen3-8B:
accuracy: 30
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
- accuracy: 34.003
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
- accuracy: 27.810

View File

@ -16,3 +16,12 @@ deepseek-ai/DeepSeek-R1:
accuracy: 70.45
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
- accuracy: 44.95
- quant_algo: FP8
accuracy: 49.49
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
- accuracy: 40.40
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
- accuracy: 58.08
- quant_algo: FP8
kv_cache_quant_algo: FP8
accuracy: 57.07

View File

@ -72,5 +72,14 @@ Qwen3/Qwen3-235B-A22B:
accuracy: 85.78
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
- accuracy: 92.57
- quant_algo: FP8
accuracy: 92.42
nvidia/Nemotron-H-8B-Base-8K:
- accuracy: 46.20
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
- accuracy: 37.15
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
- accuracy: 94.43
- quant_algo: FP8
kv_cache_quant_algo: FP8
accuracy: 94.16

View File

@ -28,6 +28,26 @@ meta-llama/Llama-3.1-8B-Instruct:
- quant_algo: FP8
kv_cache_quant_algo: FP8
accuracy: 67.87
meta-llama/Llama-3.2-1B:
- quant_algo: W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN
accuracy: 32.72
- quant_algo: W8A8_SQ_PER_CHANNEL
accuracy: 32.07
- quant_algo: W4A16_AWQ
accuracy: 30.56
- quant_algo: W4A16_AWQ
kv_cache_quant_algo: INT8
accuracy: 31.29
- quant_algo: FP8
kv_cache_quant_algo: FP8
accuracy: 31.02
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
accuracy: 33.97
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
extra_acc_spec: meta_recipe
accuracy: 33.87
- extra_acc_spec: max_attention_window_size=960
accuracy: 32.82
meta-llama/Llama-3.3-70B-Instruct:
- accuracy: 81.31
- quant_algo: NVFP4
@ -128,9 +148,16 @@ Qwen3/Qwen3-235B-A22B:
accuracy: 86
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
- accuracy: 79.43
- quant_algo: FP8
accuracy: 79.26
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
- accuracy: 57.97
nvidia/Nemotron-H-8B-Base-8K:
- accuracy: 69.590
microsoft/Phi-4-mini-instruct:
- accuracy: 68.98
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
- accuracy: 83.70
- quant_algo: FP8
kv_cache_quant_algo: FP8
accuracy: 83.36

View File

@ -200,6 +200,97 @@ class TestNemotronMini4BInstruct(CliFlowAccuracyTestHarness):
self.run(quant_algo=QuantAlgo.FP8, kv_cache_quant_algo=QuantAlgo.FP8)
# TODO: Remove the CLI tests once NIMs use PyTorch backend
class TestLlama3_3NemotronSuper49Bv1(CliFlowAccuracyTestHarness):
MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1"
EXAMPLE_FOLDER = "models/core/nemotron_nas"
@pytest.mark.skip_less_device(2)
def test_auto_dtype_tp2(self):
self.run(tasks=[MMLU(self.MODEL_NAME)], tp_size=2, dtype='auto')
@pytest.mark.skip(
reason="nemotron-nas scripts have to accommodate fp8 flags")
@pytest.mark.skip_less_device(2)
@pytest.mark.skip_device_not_contain(["H100", "B200"])
def test_fp8_prequantized_tp2(self, mocker):
mocker.patch.object(
self.__class__, "MODEL_PATH",
f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8"
)
self.run(tasks=[MMLU(self.MODEL_NAME)],
tp_size=2,
quant_algo=QuantAlgo.FP8)
class TestNemotronNano(CliFlowAccuracyTestHarness):
MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
MODEL_PATH = f"{llm_models_root()}/Llama-3.1-Nemotron-Nano-8B-v1"
EXAMPLE_FOLDER = "models/core/llama"
def test_auto_dtype(self):
self.run(tasks=[MMLU(self.MODEL_NAME)], dtype='auto')
class TestNemotronUltra(CliFlowAccuracyTestHarness):
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1"
EXAMPLE_FOLDER = "models/core/nemotron_nas"
@skip_pre_hopper
@pytest.mark.skip_less_device(8)
@pytest.mark.skip_device_not_contain(["H100", "B200"])
@parametrize_with_ids("cuda_graph", [False, True])
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
(8, 1, 8)],
ids=["tp8", "tp8ep4", "tp8ep8"])
def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
extra_summarize_args = []
if cuda_graph:
extra_summarize_args.append("--cuda_graph_mode")
self.run(tasks=[MMLU(self.MODEL_NAME)],
tp_size=tp_size,
pp_size=pp_size,
extra_convert_args=[
f"--moe_tp_size={tp_size // ep_size}",
f"--moe_ep_size={ep_size}", f"--moe_renorm_mode={0}"
],
extra_build_args=["--gemm_plugin=auto", "--moe_plugin=auto"],
extra_summarize_args=extra_summarize_args)
@skip_pre_hopper
@pytest.mark.skip_less_device(8)
@pytest.mark.skip_device_not_contain(["H100", "B200"])
@parametrize_with_ids("cuda_graph", [False, True])
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
(8, 1, 8)],
ids=["tp8", "tp8ep4", "tp8ep8"])
def test_fp8_prequantized(self, cuda_graph, tp_size, pp_size, ep_size,
mocker):
mocker.patch.object(
self.__class__, "MODEL_PATH",
f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1-FP8"
)
extra_summarize_args = []
if cuda_graph:
extra_summarize_args.append("--cuda_graph_mode")
self.run(tasks=[MMLU(self.MODEL_NAME)],
quant_algo=QuantAlgo.FP8,
kv_cache_quant_algo=QuantAlgo.FP8,
tp_size=tp_size,
pp_size=pp_size,
extra_convert_args=[
f"--moe_tp_size={tp_size // ep_size}",
f"--moe_ep_size={ep_size}", f"--moe_renorm_mode={0}"
],
extra_build_args=["--gemm_plugin=auto", "--moe_plugin=auto"],
extra_summarize_args=extra_summarize_args)
@skip_post_blackwell
class TestPhi2(CliFlowAccuracyTestHarness):
MODEL_NAME = "microsoft/phi-2"
@ -847,9 +938,7 @@ class TestLlama3_3_70BInstruct(CliFlowAccuracyTestHarness):
@pytest.mark.skip_device_not_contain(["B200"])
def test_nvfp4_prequantized_tp4(self, mocker):
mocker.patch.object(
self.__class__,
"MODEL_PATH",
model_path=
self.__class__, "MODEL_PATH",
f"{llm_models_root()}/modelopt-hf-model-hub/Llama-3.3-70B-Instruct-fp4"
)
self.run(tasks=[MMLU(self.MODEL_NAME)],

View File

@ -2,12 +2,12 @@
# I need to to this by creating a new class that mimics LLM class. Instead of implementing the
# actual methods it will send OAI requests to the disaggregated serving endpoint.
# Please take a look at the existing test_llm_api_pytorch.py file for reference.
import concurrent
import contextlib
import os
import shutil
import subprocess
import tempfile
import time
from collections import namedtuple
from concurrent.futures import ThreadPoolExecutor
from typing import Any, Dict, List, Optional
@ -16,11 +16,12 @@ import pytest
import requests
import yaml
from tensorrt_llm._torch import LLM
from tensorrt_llm.executor.result import GenerationResultBase
from tensorrt_llm.llmapi import CompletionOutput, RequestOutput, SamplingParams
from tensorrt_llm.llmapi.llm_args import LlmArgs
from ..conftest import llm_models_root
from ..trt_test_alternative import popen
from .accuracy_core import GSM8K, MMLU, LlmapiAccuracyTestHarness
@ -40,76 +41,85 @@ class Result(GenerationResultBase):
return self
class OpenAIServerClient:
DuckLLM = namedtuple('DuckLLM', ['args', 'generate_async'])
def __init__(self,
disaggregated_server_config: Dict[str, Any],
ctx_server_config: Dict[str, Any],
gen_server_config: Dict[str, Any],
model_name: str,
tensor_parallel_size: int = 1):
self.thread_pool = ThreadPoolExecutor(max_workers=16)
self.temp_dir = tempfile.mkdtemp()
self.futures = []
self.disaggregated_serving_config_path = os.path.join(
self.temp_dir, "disaggregated_serving_config.yaml")
with open(self.disaggregated_serving_config_path, "w") as f:
yaml.dump(disaggregated_server_config, f)
ctx_server_config_path = os.path.join(self.temp_dir,
"ctx_server_config.yaml")
with open(ctx_server_config_path, "w") as f:
yaml.dump(ctx_server_config, f)
gen_server_config_path = os.path.join(self.temp_dir,
"gen_server_config.yaml")
with open(gen_server_config_path, "w") as f:
yaml.dump(gen_server_config, f)
with LLM(model_name, tensor_parallel_size=tensor_parallel_size) as llm:
self.args = llm.args
class MyThreadPoolExecutor(ThreadPoolExecutor):
cuda_device_idx = 0
cuda_devices = []
for i in range(tensor_parallel_size):
cuda_devices.append(f"{cuda_device_idx}")
cuda_device_idx += 1
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.futures: list[concurrent.futures.Future[RequestOutput]] = []
trtllm_serve_path = "trtllm-serve"
# Common arguments for both servers
common_args = [
trtllm_serve_path, model_name, "--host", "localhost", "--backend",
"pytorch"
]
if tensor_parallel_size > 1:
common_args.append(f"--tp_size={tensor_parallel_size}")
env_ctx = os.environ.copy()
env_ctx["TRTLLM_USE_UCX_KVCACHE"] = "1"
env_ctx["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_devices)
# Start the context server
self._ctx_server = subprocess.Popen(common_args + [
"--port", "8001", "--extra_llm_api_options", ctx_server_config_path
],
env=env_ctx)
# Start the generation server
env_gen = os.environ.copy()
env_gen["TRTLLM_USE_UCX_KVCACHE"] = "1"
cuda_devices = []
for i in range(tensor_parallel_size):
cuda_devices.append(f"{cuda_device_idx}")
cuda_device_idx += 1
env_gen["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_devices)
self._gen_server = subprocess.Popen(common_args + [
"--port", "8002", "--extra_llm_api_options", gen_server_config_path
],
env=env_gen)
def __exit__(self, exc_type, exc_val, exc_tb):
if exc_type is None:
for future in self.futures:
future.result()
return super().__exit__(exc_type, exc_val, exc_tb)
# Start the disaggregated server
self._disaggregated_server = subprocess.Popen([
trtllm_serve_path, "disaggregated", "-c",
self.disaggregated_serving_config_path, "--server_start_timeout",
"3600"
])
self.model_name = model_name
for future in self.futures:
future.cancel()
self.shutdown(wait=False, cancel_futures=True)
return False
@contextlib.contextmanager
def launch_disaggregated_llm(disaggregated_server_config: Dict[str, Any],
ctx_server_config: Dict[str, Any],
gen_server_config: Dict[str, Any],
model_name: str,
tensor_parallel_size: int = 1):
temp_dir = tempfile.TemporaryDirectory()
disaggregated_serving_config_path = os.path.join(
temp_dir.name, "disaggregated_serving_config.yaml")
with open(disaggregated_serving_config_path, "w") as f:
yaml.dump(disaggregated_server_config, f)
ctx_server_config_path = os.path.join(temp_dir.name,
"ctx_server_config.yaml")
with open(ctx_server_config_path, "w") as f:
yaml.dump(ctx_server_config, f)
gen_server_config_path = os.path.join(temp_dir.name,
"gen_server_config.yaml")
with open(gen_server_config_path, "w") as f:
yaml.dump(gen_server_config, f)
args = LlmArgs.from_kwargs(model=model_name,
tensor_parallel_size=tensor_parallel_size)
trtllm_serve_path = "trtllm-serve"
# Common arguments for both servers
common_args = [
trtllm_serve_path, model_name, "--host", "localhost", "--backend",
"pytorch"
]
if tensor_parallel_size > 1:
common_args.append(f"--tp_size={tensor_parallel_size}")
env_ctx = os.environ.copy()
env_ctx["TRTLLM_USE_UCX_KVCACHE"] = "1"
env_ctx["CUDA_VISIBLE_DEVICES"] = ",".join(
map(str, range(tensor_parallel_size)))
env_gen = os.environ.copy()
env_gen["TRTLLM_USE_UCX_KVCACHE"] = "1"
env_gen["CUDA_VISIBLE_DEVICES"] = ",".join(
map(str, range(tensor_parallel_size, 2 * tensor_parallel_size)))
with (MyThreadPoolExecutor(max_workers=16) as thread_pool, temp_dir,
popen(common_args + [
"--port", "8001", "--extra_llm_api_options",
ctx_server_config_path
],
env=env_ctx) as ctx_server,
popen(common_args + [
"--port", "8002", "--extra_llm_api_options",
gen_server_config_path
],
env=env_gen) as gen_server,
popen([
trtllm_serve_path, "disaggregated", "-c",
disaggregated_serving_config_path, "--server_start_timeout",
"3600"
]) as disaggregated_server):
while True:
time.sleep(1)
try:
@ -120,54 +130,47 @@ class OpenAIServerClient:
except requests.exceptions.ConnectionError:
continue
self.client = openai.OpenAI(api_key="1234567890",
base_url=f"http://localhost:8000/v1")
client = openai.OpenAI(api_key="1234567890",
base_url=f"http://localhost:8000/v1")
def send_request(self, prompt: str, sampling_params: SamplingParams):
response = self.client.completions.create(
model=self.model_name,
prompt=prompt,
stream=False,
**({
"max_tokens": sampling_params.max_tokens,
"temperature": sampling_params.temperature,
"top_p": sampling_params.top_p,
"stop": sampling_params.stop,
"seed": sampling_params.seed
} if sampling_params else {}))
result = Result(
id=0,
sampling_params=sampling_params,
outputs=[CompletionOutput(text=response.choices[0].text, index=0)])
requested_output = RequestOutput._from_generation_result(result,
prompt=prompt)
setattr(requested_output, "result", result.result)
return requested_output
def send_request(prompt: str, sampling_params: SamplingParams):
response = client.completions.create(
model=model_name,
prompt=prompt,
stream=False,
**({
"max_tokens": sampling_params.max_tokens,
"temperature": sampling_params.temperature,
"top_p": sampling_params.top_p,
"stop": sampling_params.stop,
"seed": sampling_params.seed
} if sampling_params else {}))
result = Result(id=0,
sampling_params=sampling_params,
outputs=[
CompletionOutput(text=response.choices[0].text,
index=0)
])
requested_output = RequestOutput._from_generation_result(
result, prompt=prompt)
setattr(requested_output, "result", result.result)
return requested_output
def generate_async(self,
prompt: str,
sampling_params: Optional[SamplingParams] = None):
future = self.thread_pool.submit(self.send_request, prompt,
sampling_params)
self.futures.append(future)
return future
def generate_async(prompt: str,
sampling_params: Optional[SamplingParams] = None):
future = thread_pool.submit(send_request, prompt, sampling_params)
thread_pool.futures.append(future)
return future
def __enter__(self):
return self
yield DuckLLM(args, generate_async)
def __exit__(self, exc_type, exc_value, traceback):
shutil.rmtree(self.temp_dir)
self._ctx_server.terminate()
self._gen_server.terminate()
self._disaggregated_server.terminate()
ctx_server.terminate()
gen_server.terminate()
disaggregated_server.terminate()
self._ctx_server.wait()
self._gen_server.wait()
self._disaggregated_server.wait()
for future in self.futures:
future.result()
self.thread_pool.shutdown(wait=True)
ctx_server.wait()
gen_server.wait()
disaggregated_server.wait()
class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
@ -201,12 +204,13 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
"urls": ["localhost:8002"]
}
}
with OpenAIServerClient(disaggregated_server_config, ctx_server_config,
gen_server_config, self.MODEL_PATH) as client:
with launch_disaggregated_llm(disaggregated_server_config,
ctx_server_config, gen_server_config,
self.MODEL_PATH) as llm:
task = MMLU(self.MODEL_NAME)
task.evaluate(client)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(client)
task.evaluate(llm)
class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
@ -215,6 +219,7 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
@pytest.mark.parametrize("overlap_scheduler", [False, True])
def test_auto_dtype(self, overlap_scheduler):
pytest.skip("https://nvbugs/5297821")
ctx_server_config = {
"pytorch_backend_config": {
"disable_overlap_scheduler": True
@ -238,12 +243,12 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
"urls": ["localhost:8002"]
}
}
with OpenAIServerClient(disaggregated_server_config,
ctx_server_config,
gen_server_config,
self.MODEL_PATH,
tensor_parallel_size=4) as client:
with launch_disaggregated_llm(disaggregated_server_config,
ctx_server_config,
gen_server_config,
self.MODEL_PATH,
tensor_parallel_size=4) as llm:
task = MMLU(self.MODEL_NAME)
task.evaluate(client)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(client)
task.evaluate(llm)

View File

@ -188,10 +188,11 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
@pytest.mark.skip(reason="https://nvbugspro.nvidia.com/bug/5292517")
@skip_pre_hopper
def test_fp8_llm_decoder(self):
def test_fp8_llm_sampler(self):
model_path = f"{llm_models_root()}/llama-3.1-model/Llama-3.1-8B-Instruct-FP8"
pytorch_config = PyTorchConfig(enable_trtllm_decoder=True)
pytorch_config = PyTorchConfig(enable_trtllm_sampler=True)
llm = LLM(model_path, pytorch_backend_config=pytorch_config)
assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
@ -207,6 +208,79 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
extra_acc_spec="temperature=0.8,top_p=0.95")
class TestLlama3_2_1B(LlmapiAccuracyTestHarness):
MODEL_NAME = "meta-llama/Llama-3.2-1B"
MODEL_PATH = f"{llm_models_root()}/llama-3.2-models/Llama-3.2-1B"
EXAMPLE_FOLDER = "models/core/llama"
def test_auto_dtype(self):
with LLM(self.MODEL_PATH) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_post_blackwell
def test_smooth_quant(self):
quant_config = QuantConfig(
QuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN)
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_post_blackwell
def test_smooth_quant_ootb(self):
quant_config = QuantConfig(QuantAlgo.W8A8_SQ_PER_CHANNEL)
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_post_blackwell
def test_int4_awq(self):
quant_config = QuantConfig(QuantAlgo.W4A16_AWQ)
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_post_blackwell
def test_int4_awq_int8_kv_cache(self):
quant_config = QuantConfig(QuantAlgo.W4A16_AWQ)
kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.INT8)
with LLM(self.MODEL_PATH,
quant_config=quant_config,
kv_cache_config=kv_cache_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_pre_ada
def test_fp8(self):
quant_config = QuantConfig(QuantAlgo.FP8)
kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.FP8)
with LLM(self.MODEL_PATH,
quant_config=quant_config,
kv_cache_config=kv_cache_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_pre_ada
@pytest.mark.skip_less_device(2)
def test_fp8_pp2(self):
quant_config = QuantConfig(QuantAlgo.FP8)
kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.FP8)
with LLM(self.MODEL_PATH,
pipeline_parallel_size=2,
quant_config=quant_config,
kv_cache_config=kv_cache_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
@skip_pre_ada
@skip_post_blackwell
def test_fp8_rowwise(self):
quant_config = QuantConfig(QuantAlgo.FP8_PER_CHANNEL_PER_TOKEN)
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
class TestLlama3_3_70BInstruct(LlmapiAccuracyTestHarness):
MODEL_NAME = "meta-llama/Llama-3.3-70B-Instruct"
@ -924,7 +998,7 @@ class TestNemotronNas(LlmapiAccuracyTestHarness):
@pytest.mark.skip_less_device_memory(80000)
class TestNemotronSuper(LlmapiAccuracyTestHarness):
class TestLlama3_3NemotronSuper49Bv1(LlmapiAccuracyTestHarness):
MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1"
@ -939,6 +1013,20 @@ class TestNemotronSuper(LlmapiAccuracyTestHarness):
task.evaluate(llm,
extra_evaluator_kwargs=dict(apply_chat_template=True))
@pytest.mark.skip_less_device(2)
@pytest.mark.skip_device_not_contain(["H100", "B200"])
def test_fp8_prequantized_tp2(self):
model_path = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8"
with LLM(model_path, tensor_parallel_size=2) as llm:
assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
task = GPQADiamond(self.MODEL_NAME)
task.evaluate(llm,
extra_evaluator_kwargs=dict(apply_chat_template=True))
class TestNemotronNano(LlmapiAccuracyTestHarness):
MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
@ -946,10 +1034,61 @@ class TestNemotronNano(LlmapiAccuracyTestHarness):
def test_auto_dtype(self):
with LLM(self.MODEL_PATH) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
task = GPQADiamond(self.MODEL_NAME)
task.evaluate(llm,
extra_evaluator_kwargs=dict(apply_chat_template=True))
class TestNemotronUltra(LlmapiAccuracyTestHarness):
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1"
@pytest.mark.skip_less_device(8)
@pytest.mark.skip_device_not_contain(["H100", "B200"])
@parametrize_with_ids("cuda_graph", [False, True])
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
(8, 1, 8)],
ids=["tp8", "tp8ep4", "tp8ep8"])
def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
with LLM(self.MODEL_PATH,
tensor_parallel_size=tp_size,
pipeline_parallel_size=pp_size,
moe_expert_parallel_size=ep_size,
use_cuda_graph=cuda_graph) as llm:
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
task = GPQADiamond(self.MODEL_NAME)
task.evaluate(llm,
extra_evaluator_kwargs=dict(apply_chat_template=True))
@pytest.mark.skip_less_device(8)
@pytest.mark.skip_device_not_contain(["H100", "B200"])
@parametrize_with_ids("cuda_graph", [False, True])
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
(8, 1, 8)],
ids=["tp8", "tp8ep4", "tp8ep8"])
def test_fp8_prequantized(self, cuda_graph, tp_size, pp_size, ep_size):
model_path = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1-FP8"
with LLM(model_path,
tensor_parallel_size=tp_size,
pipeline_parallel_size=pp_size,
moe_expert_parallel_size=ep_size,
use_cuda_graph=cuda_graph) as llm:
assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
task = GPQADiamond(self.MODEL_NAME)
task.evaluate(llm,
extra_evaluator_kwargs=dict(apply_chat_template=True))
class TestNemotronH(LlmapiAccuracyTestHarness):
@ -1185,3 +1324,24 @@ class TestQwen3_235B_A22B(LlmapiAccuracyTestHarness):
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
class TestPhi4MiniInstruct(LlmapiAccuracyTestHarness):
MODEL_NAME = "microsoft/Phi-4-mini-instruct"
MODEL_PATH = f"{llm_models_root()}/Phi-4-mini-instruct"
@pytest.mark.skip(
reason=
"Temporarily skipping test_auto_dtype while resolving Phi-4's architecture issue."
)
def test_auto_dtype(self):
with LLM(self.MODEL_PATH) as llm:
task = CnnDailymail(self.MODEL_NAME)
task.evaluate(llm)
task = MMLU(self.MODEL_NAME)
task.evaluate(llm)
task = GSM8K(self.MODEL_NAME)
task.evaluate(llm)
task = GPQADiamond(self.MODEL_NAME)
task.evaluate(llm,
extra_evaluator_kwargs=dict(apply_chat_template=True))

View File

@ -24,23 +24,23 @@ from packaging import version
from .trt_test_alternative import check_call, check_output, exists, is_windows
def venv_check_call(venv, cmd, running_log=None, env=None):
def venv_check_call(venv, cmd, env=None, **kwargs):
def _war_check_call(*args, **kwargs):
kwargs["cwd"] = venv.get_working_directory()
return check_call(*args, **kwargs)
venv.run_cmd(cmd, caller=_war_check_call, running_log=running_log, env=env)
venv.run_cmd(cmd, caller=_war_check_call, env=env, **kwargs)
def venv_check_output(venv, cmd):
def venv_check_output(venv, cmd, env=None, **kwargs):
def _war_check_output(*args, **kwargs):
kwargs["cwd"] = venv.get_working_directory()
output = check_output(*args, **kwargs)
return output
return venv.run_cmd(cmd, caller=_war_check_output)
return venv.run_cmd(cmd, caller=_war_check_output, env=env, **kwargs)
def venv_mpi_check_call(venv, mpi_cmd, python_cmd):

View File

@ -22,6 +22,7 @@ import subprocess as sp
import tempfile
import time
import urllib.request
import warnings
from functools import wraps
from pathlib import Path
from typing import Iterable, Sequence
@ -2196,8 +2197,10 @@ def skip_by_host_memory(request):
IS_UNDER_CI_ENV = 'JENKINS_HOME' in os.environ
gpu_warning_threshold = 1024 * 1024 * 1024
def collect_status():
def collect_status(item: pytest.Item):
if not IS_UNDER_CI_ENV:
return
@ -2210,6 +2213,22 @@ def collect_status():
for idx in range(pynvml.nvmlDeviceGetCount())
}
deadline = time.perf_counter() + 60 # 1 min
observed_used = 0
global gpu_warning_threshold
while time.perf_counter() < deadline:
observed_used = max(
pynvml.nvmlDeviceGetMemoryInfo(device).used
for device in handles.values())
if observed_used <= gpu_warning_threshold:
break
time.sleep(1)
else:
gpu_warning_threshold = max(observed_used, gpu_warning_threshold)
warnings.warn(
f"Test {item.name} does not free up GPU memory correctly!")
gpu_memory = {}
for idx, device in handles.items():
total_used = pynvml.nvmlDeviceGetMemoryInfo(device).used // 1024 // 1024
@ -2218,13 +2237,12 @@ def collect_status():
process = {}
for entry in detail:
host_memory_in_mbs = -1
try:
host_memory_in_mbs = psutil.Process(
entry.pid).memory_full_info().uss // 1024 // 1024
p = psutil.Process(entry.pid)
host_memory_in_mbs = p.memory_full_info().uss // 1024 // 1024
process[entry.pid] = (entry.usedGpuMemory // 1024 // 1024,
host_memory_in_mbs)
except:
host_memory_in_mbs, p.cmdline())
except Exception:
pass
gpu_memory[idx] = {
@ -2239,7 +2257,7 @@ def collect_status():
@pytest.hookimpl(wrapper=True)
def pytest_runtest_protocol(item, nextitem):
ret = yield
collect_status()
collect_status(item)
return ret

View File

@ -18,14 +18,7 @@ import subprocess
import pytest
from defs.conftest import skip_no_hopper
def kill_disaggregated_processes():
"""Kill any existing disaggregated processes."""
try:
subprocess.run(['pkill', '-9', '-f', 'trtllm-serve'], check=False)
except Exception:
pass
from defs.trt_test_alternative import check_call, popen
def cleanup_output_files():
@ -120,93 +113,92 @@ def run_disaggregated_test(example_dir,
env=None,
cwd=None):
"""Run disaggregated test with given configuration."""
kill_disaggregated_processes()
cleanup_output_files()
num_ranks, config_file = get_test_config(test_desc, example_dir,
os.path.dirname(__file__))
# Start workers
workers_cmd = [
'mpirun', '--allow-run-as-root', '--oversubscribe', '-n',
str(num_ranks), 'trtllm-serve', 'disaggregated_mpi_worker', '-c',
config_file
]
with open('output_workers.log', 'w') as f:
workers_proc = subprocess.Popen(workers_cmd,
stdout=f,
stderr=subprocess.STDOUT,
env=env,
cwd=cwd)
server_start_timeout = 900
# Start server
server_cmd = [
'trtllm-serve', 'disaggregated', '--server_start_timeout',
str(server_start_timeout), '-c', config_file
]
with open('output_disagg.log', 'w') as f:
server_proc = subprocess.Popen(server_cmd,
stdout=f,
stderr=subprocess.STDOUT,
env=env,
cwd=cwd)
client_dir = f"{example_dir}/clients"
for _ in range(num_iters):
client_cmd = [
'python3', f'{client_dir}/disagg_client.py', '-c',
f'{example_dir}/disagg_config.yaml', '-p',
f'{client_dir}/prompts.json', '--ignore-eos',
'--server-start-timeout',
str(server_start_timeout)
]
subprocess.run(client_cmd, check=True, env=env)
# Streaming client run
streaming_client_cmd = client_cmd + [
'--streaming', '-o', 'output_streaming.json'
]
subprocess.run(streaming_client_cmd, check=True, env=env)
# Run the chat completion endpoint test only for TinyLlama
if test_desc == "overlap":
chat_client_cmd = client_cmd + [
'-e', 'chat', '-o', 'output_chat.json'
with ( # Start workers
open('output_workers.log', 'w') as output_workers,
popen(workers_cmd,
stdout=output_workers,
stderr=subprocess.STDOUT,
env=env,
cwd=cwd),
# Start server
open('output_disagg.log', 'w') as output_disagg,
popen(server_cmd,
stdout=output_disagg,
stderr=subprocess.STDOUT,
env=env,
cwd=cwd)):
client_dir = f"{example_dir}/clients"
for _ in range(num_iters):
client_cmd = [
'python3', f'{client_dir}/disagg_client.py', '-c',
f'{example_dir}/disagg_config.yaml', '-p',
f'{client_dir}/prompts.json', '--ignore-eos',
'--server-start-timeout',
str(server_start_timeout)
]
subprocess.run(chat_client_cmd, check=True, env=env)
check_call(client_cmd, env=env)
streaming_chat_client_cmd = chat_client_cmd + [
'--streaming', '-o', 'output_streaming_chat.json'
# Streaming client run
streaming_client_cmd = client_cmd + [
'--streaming', '-o', 'output_streaming.json'
]
subprocess.run(streaming_chat_client_cmd, check=True, env=env)
check_call(streaming_client_cmd, env=env)
# Verify outputs
not_expected_strings = ["Berlin Berlin"]
# Run the chat completion endpoint test only for TinyLlama
if test_desc == "overlap":
chat_client_cmd = client_cmd + [
'-e', 'chat', '-o', 'output_chat.json'
]
check_call(chat_client_cmd, env=env)
output_files = ['output.json', 'output_streaming.json']
if test_desc == "overlap":
# Disable streaming chat completion for overlap test
# due to bug
output_files.extend(['output_chat.json'])
streaming_chat_client_cmd = chat_client_cmd + [
'--streaming', '-o', 'output_streaming_chat.json'
]
check_call(streaming_chat_client_cmd, env=env)
if test_desc.startswith("gen_only"):
continue
# Verify outputs
not_expected_strings = ["Berlin Berlin"]
for output_file in output_files:
with open(output_file, 'r') as f:
content = f.read()
if "deepseek_v3_lite" in test_desc or output_file == "output_chat.json":
expected_strings = ["Berlin", "Asyncio is a"]
else:
expected_strings = [
"The capital of Germany is Berlin",
"Asyncio is a Python library"
]
for expected_string in expected_strings:
assert expected_string in content, f"Expected string '{expected_string}' not found in {output_file}"
for not_expected_string in not_expected_strings:
assert not_expected_string not in content, f"Unexpected string '{not_expected_string}' found in {output_file}"
output_files = ['output.json', 'output_streaming.json']
if test_desc == "overlap":
# Disable streaming chat completion for overlap test
# due to bug
output_files.extend(['output_chat.json'])
if test_desc.startswith("gen_only"):
continue
for output_file in output_files:
with open(output_file, 'r') as f:
content = f.read()
if "deepseek_v3_lite" in test_desc or output_file == "output_chat.json":
expected_strings = ["Berlin", "Asyncio is a"]
else:
expected_strings = [
"The capital of Germany is Berlin",
"Asyncio is a Python library"
]
for expected_string in expected_strings:
assert expected_string in content, f"Expected string '{expected_string}' not found in {output_file}"
for not_expected_string in not_expected_strings:
assert not_expected_string not in content, f"Unexpected string '{not_expected_string}' found in {output_file}"
# Print outputs
print("------------------")
@ -221,8 +213,6 @@ def run_disaggregated_test(example_dir,
with open('output_disagg.log', 'r') as f:
print(f.read())
kill_disaggregated_processes()
@pytest.mark.parametrize("llama_model_root", ['TinyLlama-1.1B-Chat-v1.0'],
indirect=True)

View File

@ -9,6 +9,7 @@ from typing import List, Optional, Tuple
import aiohttp
import pytest
import yaml
from defs.trt_test_alternative import popen
from transformers import AutoTokenizer
from tensorrt_llm import logger
@ -53,11 +54,11 @@ def run_disaggregated_workers(
config_file
]
logger.info(f"Running workers with command: {' '.join(workers_cmd)}")
workers_proc = subprocess.Popen(workers_cmd,
stdout=stdout,
stderr=subprocess.STDOUT,
env=env,
cwd=cwd)
workers_proc = popen(workers_cmd,
stdout=stdout,
stderr=subprocess.STDOUT,
env=env,
cwd=cwd)
return workers_proc, ctx_servers, gen_servers
@ -500,19 +501,18 @@ def load_default_prompts(disaggregated_example_root: str):
@contextlib.contextmanager
def background_workers(llm_venv, config_file: str, num_ranks: int = None):
cwd = llm_venv.get_working_directory()
log_file = open(os.path.join(cwd, 'output_workers.log'), 'w')
workers_proc, ctx_servers, gen_servers = run_disaggregated_workers(
config_file=config_file,
stdout=log_file,
env=llm_venv._new_env,
cwd=cwd,
num_ranks=num_ranks)
try:
yield ctx_servers, gen_servers
finally:
workers_proc.terminate()
workers_proc.wait()
log_file.close()
with open(os.path.join(cwd, 'output_workers.log'), 'w') as log_file:
workers_proc, ctx_servers, gen_servers = run_disaggregated_workers(
config_file=config_file,
stdout=log_file,
env=llm_venv._new_env,
cwd=cwd,
num_ranks=num_ranks)
with workers_proc as proc:
yield ctx_servers, gen_servers
proc.terminate()
proc.wait()
@pytest.mark.parametrize("llama_model_root", ['TinyLlama-1.1B-Chat-v1.0'],

View File

@ -741,7 +741,7 @@ def test_trtllm_bench_pytorch_backend_sanity(llm_root, llm_venv,
dir="./",
delete=True,
delete_on_close=True) as running_log:
check_call(benchmark_cmd, shell=True, running_log=running_log)
check_call(benchmark_cmd, shell=True, stdout=running_log)
if model_id in mapping and not use_extra_config:
# extra config defines max kv cache tokens number to be 40000 which makes the checking
# the checking process not unified.
@ -775,7 +775,7 @@ def test_trtllm_bench_mgmn(llm_root, llm_venv):
delete_on_close=True) as running_log:
check_call(benchmark_cmd,
shell=True,
running_log=running_log,
stdout=running_log,
env=llm_venv._new_env)
_check_mem_usage(running_log, [30, 0, 0, 0])
@ -928,7 +928,7 @@ def test_trtllm_bench_iteration_log(llm_root, llm_venv, model_name,
dir="./",
delete=True,
delete_on_close=True) as running_log:
check_call(benchmark_cmd, shell=True, running_log=running_log)
check_call(benchmark_cmd, shell=True, stdout=running_log)
_check_mem_usage(running_log, [19.4, 0, 0, 0])
else:
check_call(benchmark_cmd, shell=True)
@ -1454,7 +1454,7 @@ def test_ptp_quickstart(llm_root, llm_venv):
delete=True,
delete_on_close=True) as running_log:
venv_check_call(llm_venv, [str(example_root / "quickstart.py")],
running_log=running_log)
stdout=running_log)
_check_mem_usage(running_log, [4.60, 0, 0, 0])
@ -1476,6 +1476,9 @@ def test_ptp_quickstart(llm_root, llm_venv):
pytest.param('Llama3.1-70B-FP8',
'llama-3.1-model/Llama-3.1-70B-Instruct-FP8',
marks=skip_pre_hopper),
pytest.param('Nemotron-Super-49B-v1-NVFP4',
'nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf',
marks=skip_pre_hopper),
pytest.param('Nemotron-Super-49B-v1-FP8',
'nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8',
marks=skip_pre_hopper),
@ -1517,7 +1520,7 @@ def test_ptp_quickstart_advanced(llm_root, llm_venv, model_name, model_path):
]
if "Qwen3" in model_name:
cmds.append(f"--kv_cache_fraction=0.6")
llm_venv.run_cmd(cmds, running_log=running_log)
llm_venv.run_cmd(cmds, stdout=running_log)
if model_name in mapping:
_check_mem_usage(running_log, [mapping[model_name], 0, 0, 0])
@ -1545,7 +1548,7 @@ def test_ptq_quickstart_advanced_mtp(llm_root, llm_venv, model_name,
"--model_dir",
f"{llm_models_root()}/{model_path}",
],
running_log=running_log)
stdout=running_log)
_check_mem_usage(running_log, [54.50, 0, 0, 0])
@ -1601,7 +1604,7 @@ def test_ptp_quickstart_advanced_eagle3(llm_root, llm_venv, model_name,
"--disable_kv_cache_reuse",
"--disable_overlap_scheduler",
],
running_log=running_log)
stdout=running_log)
_check_mem_usage(running_log, [25.2, 0, 0, 0])
@ -1635,7 +1638,7 @@ def test_ptp_quickstart_advanced_deepseek_r1_8gpus(llm_root, llm_venv,
"--max_seq_len=3000",
"--disable_kv_cache_reuse",
],
running_log=running_log)
stdout=running_log)
_check_mem_usage(running_log, [106.3, 0, 0, 0], 8)
@ -1675,7 +1678,7 @@ def test_relaxed_acceptance_quickstart_advanced_deepseek_r1_8gpus(
"--relaxed_topk=10",
"--relaxed_delta=0.5",
],
running_log=running_log)
stdout=running_log)
_check_mem_usage(running_log, [85.6, 0, 0, 0], 8)
# TODO: relaxed acceptance is incompatible with attention dp
# "--enable_attention_dp"
@ -1725,7 +1728,7 @@ def test_ptp_quickstart_advanced_8gpus(llm_root, llm_venv, model_name,
f"{llm_models_root()}/{model_path}",
"--tp_size=8",
],
running_log=running_log)
stdout=running_log)
if model_name in mapping:
_check_mem_usage(running_log, [mapping[model_name], 0, 0, 0], 8)
@ -1768,7 +1771,7 @@ def test_ptp_quickstart_advanced_mixed_precision(llm_root, llm_venv):
"--model_dir",
f"{llm_models_root()}/{model_path}",
],
running_log=running_log)
stdout=running_log)
_check_mem_usage(running_log, [12.0, 0, 0, 0])
@ -1959,7 +1962,7 @@ def test_ptp_quickstart_multimodal(llm_root, llm_venv, model_name, model_path,
"--media",
*functionality_inputs[modality]["media"],
],
running_log=running_log)
stdout=running_log)
if model_name in mapping:
peak, fraction = mapping[model_name]

View File

@ -894,7 +894,8 @@ def prepare_gpt_2b_lora_engine(type, tensorrt_llm_gpt_example_root,
return engine_dir
def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root,
tensorrt_llm_example_root):
# Build GPT
if type == "python_backend":
engine_dir = os.path.join(tensorrt_llm_gpt_example_root, "engine_dir",
@ -904,8 +905,7 @@ def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
"gpt_175b_ifb")
convert_cmd = [
"python3",
f"{tensorrt_llm_gpt_example_root}/../generate_checkpoint_config.py",
"python3", f"{tensorrt_llm_example_root}/generate_checkpoint_config.py",
f"--output_path={engine_dir}/ckpt_config.json",
"--architecture=GPTForCausalLM", "--dtype=float16",
"--num_hidden_layers=96", "--num_attention_heads=96",
@ -948,7 +948,8 @@ def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
return engine_dir
def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root):
def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root,
tensorrt_llm_example_root):
# Build GPT
if type == "python_backend":
engine_dir = os.path.join(tensorrt_llm_gpt_example_root, "engine_dir",
@ -958,8 +959,7 @@ def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root):
"gpt_multi_node_ifb")
convert_cmd = [
"python3",
f"{tensorrt_llm_gpt_example_root}/../generate_checkpoint_config.py",
"python3", f"{tensorrt_llm_example_root}/generate_checkpoint_config.py",
f"--output_path={engine_dir}/ckpt_config.json",
"--architecture=GPTForCausalLM", "--dtype=float16",
"--num_hidden_layers=96", "--num_attention_heads=96",
@ -1111,7 +1111,8 @@ def prepare_llama_v2_13b_engine(tensorrt_llm_llama_example_root,
return engine_dir
def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
def prepare_llama_v3_8b_engine(tensorrt_llm_example_root,
tensorrt_llm_llama_example_root,
llama_v3_8b_model_root,
workers=8,
data_type="bfloat16"):
@ -1133,7 +1134,7 @@ def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
elif data_type == "fp8":
convert_cmd = [
"python3",
"../quantization/quantize.py",
f"{tensorrt_llm_example_root}/quantization/quantize.py",
f"--model_dir={llama_v3_8b_model_root}",
"--dtype=float16",
"--qformat=fp8",
@ -1186,6 +1187,7 @@ def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
def prepare_llama_v3_70b_engine(type,
tensorrt_llm_example_root,
tensorrt_llm_llama_example_root,
llama_v3_70b_model_root,
data_type="bfloat16"):
@ -1211,7 +1213,7 @@ def prepare_llama_v3_70b_engine(type,
elif data_type == "fp8":
convert_cmd = [
"python3",
"../quantization/quantize.py",
f"{tensorrt_llm_example_root}/quantization/quantize.py",
f"--model_dir={llama_v3_70b_model_root}",
"--dtype=float16",
"--qformat=fp8",
@ -1707,7 +1709,8 @@ def prepare_tiny_llama_1b_engine(type, tensorrt_llm_llama_example_root,
return engine_dir, xgrammar_tokenizer_info_path
def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_mixtral_example_root,
def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_example_root,
tensorrt_llm_mixtral_example_root,
mixtral_8x7b_v0_1_model_root,
llm_backend_root):
engine_dir = os.path.join(tensorrt_llm_mixtral_example_root, "engine_dir",
@ -1718,7 +1721,7 @@ def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_mixtral_example_root,
# Quantize model
quantize_cmd = [
"python3",
"../quantization/quantize.py",
f"{tensorrt_llm_example_root}/quantization/quantize.py",
f"--model_dir={mixtral_8x7b_v0_1_model_root}",
"--dtype=float16",
"--qformat=fp8",

View File

@ -0,0 +1,394 @@
#!/usr/bin/env python
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
import argparse
import queue
import sys
import time
from functools import partial
import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient.utils import InferenceServerException
#
# Simple streaming client for TRT-LLM inflight bacthing backend
#
# In order for this code to work properly, config.pbtxt must contain these values:
#
# model_transaction_policy {
# decoupled: True
# }
#
# parameters: {
# key: "gpt_model_type"
# value: {
# string_value: "inflight_batching"
# }
# }
#
# In order for gpt_model_type 'inflight_batching' to work, you must copy engine from
#
# tensorrt_llm/cpp/tests/resources/models/rt_engine/gpt2/fp16-inflight-batching-plugin/1-gpu/
#
class UserData:
def __init__(self):
self._completed_requests = queue.Queue()
def prepare_inputs(input_ids_data, input_lengths_data, request_output_len_data,
beam_width_data, temperature_data, streaming_data, end_id):
inputs = [
grpcclient.InferInput('input_ids', [1, 12], "INT32"),
grpcclient.InferInput('input_lengths', [1, 1], "INT32"),
grpcclient.InferInput('request_output_len', [1, 1], "UINT32"),
grpcclient.InferInput('beam_width', [1, 1], "UINT32"),
grpcclient.InferInput('temperature', [1, 1], "FP32"),
grpcclient.InferInput('streaming', [1, 1], "BOOL"),
grpcclient.InferInput('end_id', [1, 1], "UINT32"),
]
inputs[0].set_data_from_numpy(input_ids_data)
inputs[1].set_data_from_numpy(input_lengths_data)
inputs[2].set_data_from_numpy(request_output_len_data)
inputs[3].set_data_from_numpy(beam_width_data)
inputs[4].set_data_from_numpy(temperature_data)
inputs[5].set_data_from_numpy(streaming_data)
inputs[6].set_data_from_numpy(end_id)
return inputs
def prepare_stop_signals():
inputs = [
grpcclient.InferInput('input_ids', [1, 1], "INT32"),
grpcclient.InferInput('input_lengths', [1, 1], "INT32"),
grpcclient.InferInput('request_output_len', [1, 1], "UINT32"),
grpcclient.InferInput('stop', [1, 1], "BOOL"),
]
inputs[0].set_data_from_numpy(np.empty([1, 1], dtype=np.int32))
inputs[1].set_data_from_numpy(np.zeros([1, 1], dtype=np.int32))
inputs[2].set_data_from_numpy(np.array([[0]], dtype=np.uint32))
inputs[3].set_data_from_numpy(np.array([[True]], dtype='bool'))
return inputs
# Define the callback function. Note the last two parameters should be
# result and error. InferenceServerClient would povide the results of an
# inference as grpcclient.InferResult in result. For successful
# inference, error will be None, otherwise it will be an object of
# tritonclientutils.InferenceServerException holding the error details
def callback(user_data, result, error):
if error:
user_data._completed_requests.put(error)
else:
user_data._completed_requests.put(result)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-v",
"--verbose",
action="store_true",
required=False,
default=False,
help="Enable verbose output",
)
parser.add_argument(
"-u",
"--url",
type=str,
required=False,
default="localhost:8001",
help="Inference server URL. Default is localhost:8001.",
)
parser.add_argument(
"-s",
"--ssl",
action="store_true",
required=False,
default=False,
help="Enable SSL encrypted channel to the server",
)
parser.add_argument(
"-t",
"--stream-timeout",
type=float,
required=False,
default=None,
help="Stream timeout in seconds. Default is None.",
)
parser.add_argument(
"-r",
"--root-certificates",
type=str,
required=False,
default=None,
help="File holding PEM-encoded root certificates. Default is None.",
)
parser.add_argument(
"-p",
"--private-key",
type=str,
required=False,
default=None,
help="File holding PEM-encoded private key. Default is None.",
)
parser.add_argument(
"-x",
"--certificate-chain",
type=str,
required=False,
default=None,
help="File holding PEM-encoded certificate chain. Default is None.",
)
parser.add_argument(
"-C",
"--grpc-compression-algorithm",
type=str,
required=False,
default=None,
help=
"The compression algorithm to be used when sending request to server. Default is None.",
)
parser.add_argument(
"-S",
"--streaming",
action="store_true",
required=False,
default=False,
help="Enable streaming mode. Default is False.",
)
parser.add_argument(
"-c",
"--check-output",
action="store_true",
required=False,
default=False,
help="Enable check of output ids for CI",
)
parser.add_argument(
"-b",
"--beam-width",
required=False,
type=int,
default=1,
help="Beam width value",
)
parser.add_argument(
"--temperature",
type=float,
required=False,
default=1.0,
help="temperature value",
)
parser.add_argument(
"--request-output-len",
type=int,
required=False,
default=16,
help="temperature value",
)
parser.add_argument(
'--stop-after-ms',
type=int,
required=False,
default=0,
help='Early stop the generation after a few milliseconds')
FLAGS = parser.parse_args()
print('=========')
input_ids = [[
28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257
]]
input_ids_data = np.array(input_ids, dtype=np.int32)
input_lengths = [[len(ii)] for ii in input_ids]
input_lengths_data = np.array(input_lengths, dtype=np.int32)
request_output_len = [[FLAGS.request_output_len]]
request_output_len_data = np.array(request_output_len, dtype=np.uint32)
beam_width = [[FLAGS.beam_width]]
beam_width_data = np.array(beam_width, dtype=np.uint32)
temperature = [[FLAGS.temperature]]
temperature_data = np.array(temperature, dtype=np.float32)
streaming = [[FLAGS.streaming]]
streaming_data = np.array(streaming, dtype=bool)
end_id = np.array([[6303]], dtype=np.uint32)
inputs = prepare_inputs(input_ids_data, input_lengths_data,
request_output_len_data, beam_width_data,
temperature_data, streaming_data, end_id)
if FLAGS.stop_after_ms > 0:
stop_inputs = prepare_stop_signals()
else:
stop_inputs = None
request_id = "12345"
import random
request_id = str(random.randint(3, 9000))
expected_output_ids = [
input_ids[0] + [
21221, 290, 257, 4255, 379, 262, 1957, 7072, 11, 4689, 347, 2852,
2564, 494, 13, 679
]
]
if FLAGS.streaming:
actual_output_ids = [input_ids[0]]
else:
actual_output_ids = []
user_data = UserData()
with grpcclient.InferenceServerClient(
url=FLAGS.url,
verbose=FLAGS.verbose,
ssl=FLAGS.ssl,
root_certificates=FLAGS.root_certificates,
private_key=FLAGS.private_key,
certificate_chain=FLAGS.certificate_chain,
) as triton_client:
try:
if FLAGS.streaming:
# Establish stream
triton_client.start_stream(
callback=partial(callback, user_data),
stream_timeout=FLAGS.stream_timeout,
)
# Send request
triton_client.async_stream_infer(
'tensorrt_llm',
inputs,
request_id=request_id,
)
if stop_inputs is not None:
time.sleep(FLAGS.stop_after_ms / 1000.0)
triton_client.async_stream_infer(
'tensorrt_llm',
stop_inputs,
request_id=request_id,
parameters={'Streaming': FLAGS.streaming})
#Wait for server to close the stream
triton_client.stop_stream()
# Parse the responses
while True:
try:
result = user_data._completed_requests.get(block=False)
except Exception:
break
if type(result) == InferenceServerException:
print("Received an error from server:")
print(result)
else:
output_ids = result.as_numpy('output_ids')
if output_ids is not None:
if (FLAGS.streaming):
# Only one beam is supported
tokens = list(output_ids[0][0])
actual_output_ids[
0] = actual_output_ids[0] + tokens
else:
for beam_output_ids in output_ids[0]:
tokens = list(beam_output_ids)
actual_output_ids.append(tokens)
else:
print("Got cancellation response from server")
else:
# Send request
triton_client.async_infer(
'tensorrt_llm',
inputs,
request_id=request_id,
callback=partial(callback, user_data),
parameters={'Streaming': FLAGS.streaming})
if stop_inputs is not None:
time.sleep(FLAGS.stop_after_ms / 1000.0)
triton_client.async_infer(
'tensorrt_llm',
stop_inputs,
request_id=request_id,
callback=partial(callback, user_data),
parameters={'Streaming': FLAGS.streaming})
processed_count = 0
expected_responses = 1 + (1 if stop_inputs is not None else 0)
while processed_count < expected_responses:
try:
result = user_data._completed_requests.get()
print("Got completed request", flush=True)
except Exception:
break
if type(result) == InferenceServerException:
print("Received an error from server:")
print(result)
else:
output_ids = result.as_numpy('output_ids')
if output_ids is not None:
for beam_output_ids in output_ids[0]:
tokens = list(beam_output_ids)
actual_output_ids.append(tokens)
else:
print("Got response for cancellation request")
processed_count = processed_count + 1
except Exception as e:
print("channel creation failed: " + str(e))
sys.exit()
passed = True
print("output_ids = ", actual_output_ids)
if (FLAGS.check_output):
passed = (actual_output_ids == expected_output_ids)
print("expected_output_ids = ", expected_output_ids)
print("\n=====")
print("PASS!" if passed else "FAIL!")
print("=====")
sys.exit(not passed)

View File

@ -2145,6 +2145,7 @@ def test_llama_v3_speculative_decoding_bls(
tensorrt_llm_llama_example_root,
llama_v3_8b_model_root,
llama_v3_70b_model_root,
tensorrt_llm_example_root,
llm_backend_inflight_batcher_llm_root,
llm_backend_dataset_root,
llm_backend_venv,
@ -2161,16 +2162,19 @@ def test_llama_v3_speculative_decoding_bls(
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
# Build engine
DRAFT_ENGINE_DIR = prepare_llama_v3_8b_engine(
tensorrt_llm_example_root,
tensorrt_llm_llama_example_root,
llama_v3_8b_model_root,
data_type=DATA_TYPE)
CONTROL_ENGINE_DIR = prepare_llama_v3_70b_engine(
"control_ifb",
tensorrt_llm_example_root,
tensorrt_llm_llama_example_root,
llama_v3_70b_model_root,
data_type=DATA_TYPE)
TARGET_ENGINE_DIR = prepare_llama_v3_70b_engine(
"target_ifb",
tensorrt_llm_example_root,
tensorrt_llm_llama_example_root,
llama_v3_70b_model_root,
data_type=DATA_TYPE)
@ -2310,6 +2314,7 @@ def test_gpt_175b_dummyWeights_ifb(
EXCLUDE_INPUT_IN_OUTPUT,
inflight_batcher_llm_client_root,
tensorrt_llm_gpt_example_root,
tensorrt_llm_example_root,
gpt_tokenizer_model_root,
llm_backend_venv,
):
@ -2321,7 +2326,8 @@ def test_gpt_175b_dummyWeights_ifb(
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
# Build Engine
ENGINE_PATH = prepare_gpt_175b_engine("ifb", tensorrt_llm_gpt_example_root)
ENGINE_PATH = prepare_gpt_175b_engine("ifb", tensorrt_llm_gpt_example_root,
tensorrt_llm_example_root)
# Prepare model repo
new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
prepare_ib_model_repo(llm_backend_repo_root, new_model_repo)

View File

@ -86,7 +86,8 @@ def test_valgrind_llama_v2_13b(
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
# Build engine
ENGINE_PATH = prepare_llama_v2_13b_engine(tensorrt_llm_llama_example_root,
ENGINE_PATH = prepare_llama_v2_13b_engine(tensorrt_llm_example_root,
tensorrt_llm_llama_example_root,
llama_v2_tokenizer_model_root)
# Prepare model repo

View File

@ -10,6 +10,7 @@ from .common import *
@pytest.mark.skip_less_device_memory(80000)
def test_gpt175b_dummyWeights_multi_node_engine_config(
tensorrt_llm_gpt_example_root,
tensorrt_llm_example_root,
gpt_tokenizer_model_root,
):
ACCUMULATE_TOKEN = "False"
@ -36,7 +37,8 @@ def test_gpt175b_dummyWeights_multi_node_engine_config(
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
# Build Engine
ENGINE_PATH = prepare_gpt_multi_node_engine("ifb",
tensorrt_llm_gpt_example_root)
tensorrt_llm_gpt_example_root,
tensorrt_llm_example_root)
# Prepare model repo
new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
prepare_ib_model_repo(llm_backend_repo_root, new_model_repo)

View File

@ -42,7 +42,7 @@ def get_rcca_path():
@pytest.mark.parametrize("KV_CACHE_FREE_GPU_MEM_FRACTION", [""])
@pytest.mark.parametrize("ENABLE_TRT_OVERLAP", ["False"],
ids=["disableTrtOverlap"])
@pytest.mark.parametrize("BATCHING_STRATEGY", ["V1"])
@pytest.mark.parametrize("BATCHING_STRATEGY", ["inflight_fused_batching"])
@pytest.mark.parametrize("DECOUPLED_MODE", ["False"],
ids=["disableDecoupleMode"])
@pytest.mark.parametrize("TRITON_MAX_BATCH_SIZE", ["128"])
@ -618,6 +618,7 @@ def test_rcca_bug_4714193(
TOP_K,
TOP_P,
TEMPERATURE,
tensorrt_llm_example_root,
tensorrt_llm_mixtral_example_root,
mixtral_8x7b_v0_1_model_root,
llm_backend_root,
@ -631,8 +632,8 @@ def test_rcca_bug_4714193(
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
# Build engine
ENGINE_PATH = prepare_rcca_nvbug_4714193_engine(
tensorrt_llm_mixtral_example_root, mixtral_8x7b_v0_1_model_root,
llm_backend_root)
tensorrt_llm_example_root, tensorrt_llm_mixtral_example_root,
mixtral_8x7b_v0_1_model_root, llm_backend_root)
# Prepare model repo
new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")

View File

@ -6,7 +6,8 @@ import platform
import signal
import subprocess
import sys
import tempfile
import time
import warnings
import psutil
@ -68,7 +69,9 @@ if is_linux():
return pids
def cleanup_process_tree(p: subprocess.Popen, has_session=False):
def cleanup_process_tree(p: subprocess.Popen,
has_session=False,
verbose_message=False):
target_pids = set()
if has_session:
# Session ID is the pid of the leader process
@ -82,8 +85,30 @@ if is_linux():
except psutil.Error:
pass
print("Found leftover pids:", target_pids)
for pid in target_pids:
persist_pids = []
if target_pids:
# Grace period
time.sleep(5)
lines = []
for pid in sorted(target_pids):
try:
sp = psutil.Process(pid)
if verbose_message:
cmdline = sp.cmdline()
lines.append(f"{pid}: {cmdline}")
persist_pids.append(pid)
except psutil.Error:
pass
if persist_pids:
msg = f"Found leftover subprocesses: {persist_pids} launched by {p.args}"
if verbose_message:
detail = '\n'.join(lines)
msg = f"{msg}\n{detail}"
warnings.warn(msg)
for pid in persist_pids:
try:
os.kill(pid, signal.SIGKILL)
except (ProcessLookupError, PermissionError):
@ -148,6 +173,29 @@ elif is_windows():
p.kill()
@contextlib.contextmanager
def popen(*popenargs,
start_new_session=True,
suppress_output_info=False,
**kwargs):
if not suppress_output_info:
print(f"Start subprocess with popen({popenargs}, {kwargs})")
with Popen(*popenargs, start_new_session=start_new_session, **kwargs) as p:
try:
yield p
if start_new_session:
cleanup_process_tree(p, True, True)
except Exception as e:
cleanup_process_tree(p, start_new_session)
if isinstance(e, subprocess.TimeoutExpired):
print("Process timed out.")
stdout, stderr = p.communicate()
e.output = stdout
e.stderr = stderr
raise
def call(*popenargs,
timeout=None,
start_new_session=True,
@ -155,31 +203,11 @@ def call(*popenargs,
**kwargs):
if not suppress_output_info:
print(f"Start subprocess with call({popenargs}, {kwargs})")
running_log = None
if "running_log" in kwargs:
if isinstance(kwargs["running_log"], tempfile._TemporaryFileWrapper):
running_log = kwargs["running_log"]
kwargs.pop("running_log", 'Not Found')
with Popen(*popenargs,
with popen(*popenargs,
start_new_session=start_new_session,
stdout=running_log,
suppress_output_info=True,
**kwargs) as p:
try:
retcode = p.wait(timeout=timeout)
if retcode and start_new_session:
cleanup_process_tree(p, True)
return retcode
except Exception as e:
if isinstance(e, subprocess.TimeoutExpired):
print("Process timed out.")
stdout, stderr = p.communicate()
if stdout:
print("STDOUT:", stdout.decode('utf-8', errors='replace'))
if stderr:
print("STDERR:", stderr.decode('utf-8', errors='replace'))
cleanup_process_tree(p, start_new_session)
raise
return p.wait(timeout=timeout)
def check_call(*popenargs, **kwargs):
@ -212,9 +240,9 @@ def check_output(*popenargs, timeout=None, start_new_session=True, **kwargs):
cleanup_process_tree(process, start_new_session)
raise
retcode = process.poll()
if start_new_session:
cleanup_process_tree(process, True, True)
if retcode:
if start_new_session:
cleanup_process_tree(process, True)
raise subprocess.CalledProcessError(retcode,
process.args,
output=stdout,

View File

@ -375,6 +375,14 @@ accuracy/test_cli_flow.py::TestLlama3_2_1B::test_fp8_rowwise
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_weight_streaming[1.0]
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache_beam_search
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_smooth_quant
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_smooth_quant_ootb
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_int4_awq
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_int4_awq_int8_kv_cache
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_pp2
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_rowwise
accuracy/test_cli_flow.py::TestMistral7B::test_beam_search
accuracy/test_cli_flow.py::TestMistral7B::test_fp8_tp4pp2
accuracy/test_cli_flow.py::TestMistral7B::test_smooth_quant_tp4pp1
@ -425,7 +433,7 @@ accuracy/test_llm_api.py::TestMixtral8x7B::test_tp2
accuracy/test_llm_api.py::TestMixtral8x7B::test_smooth_quant_tp2pp2
accuracy/test_llm_api.py::TestMixtral8x7BInstruct::test_awq_tp2
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_decoder
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
accuracy/test_llm_api_pytorch.py::TestLlama3_3_70BInstruct::test_fp8_tp4
accuracy/test_llm_api_pytorch.py::TestLlama3_3_70BInstruct::test_nvfp4_tp4
accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
@ -445,8 +453,16 @@ accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp_
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestMinitron4BBaseInstruct::test_fp8_prequantized
accuracy/test_llm_api_pytorch.py::TestNemotronNas::test_auto_dtype_tp8
accuracy/test_llm_api_pytorch.py::TestNemotronSuper::test_auto_dtype_tp2
accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2
accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2
accuracy/test_llm_api_pytorch.py::TestNemotronNano::test_auto_dtype
accuracy/test_cli_flow.py::TestNemotronNano::test_auto_dtype
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
accuracy/test_cli_flow.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
accuracy/test_cli_flow.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
accuracy/test_llm_api_pytorch.py::TestNemotronH::test_auto_dtype
accuracy/test_llm_api_pytorch.py::TestQwen2_7BInstruct::test_auto_dtype
accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_8gpus[latency]

View File

@ -24,6 +24,7 @@ test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-FP8-llama-3.1-model/Llama-
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-8B]
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-70B]
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-FP8-llama-3.1-model/Llama-3.1-70B-Instruct-FP8]
test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-NVFP4-nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf]
test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-FP8-nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8]
test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-NVFP4-nvfp4-quantized/Mixtral-8x7B-Instruct-v0.1]
test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-FP8-Mixtral-8x7B-Instruct-v0.1-fp8]

View File

@ -101,6 +101,7 @@ accuracy/test_cli_flow.py::TestLlama3_1_8B::test_fp8_rowwise_tp4[disable_gemm_al
accuracy/test_cli_flow.py::TestLlama3_1_8B::test_autoq
accuracy/test_cli_flow.py::TestLlama3_1_8BInstruct::test_medusa_fp8_prequantized
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_auto_dtype
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_nvfp4_prequantized_tp4
accuracy/test_cli_flow.py::TestMistral7B::test_fp8_tp4pp2
@ -120,14 +121,21 @@ accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype[tp8-c
accuracy/test_llm_api_pytorch.py::TestMixtral8x7B::test_fp8_tp2
accuracy/test_llm_api_pytorch.py::TestMixtral8x7B::test_nvfp4_tp2
accuracy/test_llm_api_pytorch.py::TestNemotronNas::test_auto_dtype_tp8
accuracy/test_llm_api_pytorch.py::TestNemotronSuper::test_auto_dtype_tp2
accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
accuracy/test_llm_api_pytorch.py::TestNemotronNano::test_auto_dtype
accuracy/test_cli_flow.py::TestNemotronNano::test_auto_dtype
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8-cuda_graph=False]
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=False]
accuracy/test_cli_flow.py::TestNemotronUltra::test_auto_dtype[tp8-cuda_graph=False]
accuracy/test_cli_flow.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestQwen3_8B::test_fp8_block_scales[latency]
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8_block_scales[latency]
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[latency_moe_cutlass]
accuracy/test_llm_api_pytorch.py::TestPhi4MiniInstruct::test_auto_dtype
# Pivot to Pytorch test cases.
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-BF16-llama-3.1-model/Meta-Llama-3.1-8B]

View File

@ -58,9 +58,6 @@ trt_llm_release_perf_sanity_test:
# E2E gptManagerBenchmark IFB
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-cppmanager-exe-static_batching-plugin_ifb-float16-bs:8+64-input_output_len:128,128+512,32]
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-cppmanager-exe-plugin_ifb-bfloat16-gwp:0.0-input_output_len:128,128+512,32]
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:512,32]
- perf/test_perf.py::test_perf[qwen2_7b_instruct-bench-float16-input_output_len:128,128]

View File

@ -49,8 +49,14 @@ trt_llm_release_perf_test:
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-bfloat16-maxbs:64-input_output_len:20000,2000-con:250]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-bfloat16-maxbs:64-input_output_len:20000,2000-quant:fp8-con:250]
# pyt backend
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-input_output_len:128,128]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-input_output_len:2000,2000]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-maxnt:5000-input_output_len:5000,500-reqs:8-con:1]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:500,2000-reqs:8-con:1]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:1000,1000-reqs:8-con:1]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-maxnt:20000-input_output_len:20000,2000-reqs:8-con:1]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:5000,500-reqs:500-con:250]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:500,2000-reqs:500-con:250]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:1000,1000-reqs:500-con:250]
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:20000,2000-reqs:500-con:250]
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:512,32]

View File

@ -19,5 +19,6 @@ l0_dgx_h200:
- accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_fp8_blockscale[latency] # 1h
- accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype[True]
- accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype[False]
- unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-enable_graph-tp8-trtllm-scout]
- unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-disable_adp-enable_graph-tp8-trtllm-scout]
- unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep4-enable_adp-enable_graph-tp8-trtllm-scout]
- unittest/llmapi/test_llm_pytorch.py::test_nemotron_nas_lora

View File

@ -25,6 +25,7 @@ l0_rtx_pro_6000:
- test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-FP8-llama-3.1-model/Llama-3.1-8B-Instruct-FP8]
- test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-70B]
- test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-FP8-llama-3.1-model/Llama-3.1-70B-Instruct-FP8]
- test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-NVFP4-nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf]
- test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-FP8-nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8]
- test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-NVFP4-nvfp4-quantized/Mixtral-8x7B-Instruct-v0.1]
- test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-FP8-Mixtral-8x7B-Instruct-v0.1-fp8]

View File

@ -83,7 +83,6 @@ full:B200_PCIe/examples/test_llama.py::test_llm_llama_v2_lora_1gpu[chinese-llama
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-mini-128k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-8k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
full:B200_PCIe/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/functional SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/quantization SKIP (Disable for Blackwell)
full:B200_PCIe/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
@ -174,7 +173,6 @@ full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-128k
full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3-mini-128k-instruct-fp8-float16] SKIP (Disable for Blackwell)
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3.5-mini-instruct-fp8-float16] SKIP (Disable for Blackwell)
full:B200/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
full:B200/unittest/trt/functional SKIP (Disable for Blackwell)
full:B200/unittest/trt/quantization SKIP (Disable for Blackwell)
full:B200/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
@ -330,11 +328,6 @@ full:B200/test_e2e.py::test_ptp_quickstart_advanced[Nemotron4_4B-BF16-nemotron/M
full:B200/test_e2e.py::test_ptp_scaffolding[DeepSeek-R1-Distill-Qwen-7B-DeepSeek-R1/DeepSeek-R1-Distill-Qwen-7B] SKIP (https://nvbugs/5136994)
full:B200/test_e2e.py::test_trtllm_bench_pytorch_backend_sanity[meta-llama/Llama-3.1-8B-llama-3.1-8b-hf-nvfp4-False-False] SKIP (https://nvbugs/5136994)
examples/test_multimodal.py::test_llm_multimodal_general[kosmos-2-pp:1-tp:1-float16-bs:8-cpp_e2e:True-nb:1] SKIP (https://nvbugs/5141288)
examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2_vl_7b_instruct-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5141290)
examples/test_qwen.py::test_llm_qwen_single_gpu_summary[qwen2_vl_7b_instruct-enable_paged_kv_cache-enable_remove_input_padding-disable_weight_only-disable_fmha] SKIP (https://nvbugs/5141290)
examples/test_qwen.py::test_llm_qwen_single_gpu_summary[qwen2_vl_7b_instruct-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha_fp32_acc] SKIP (https://nvbugs/5141290)
examples/test_qwen.py::test_llm_qwen_awq_single_gpu_summary[qwen2_vl_7b_instruct-nb:4] SKIP (https://nvbugs/5141290)
examples/test_qwen.py::test_llm_hf_qwen_quantization_1gpu[qwen2_vl_7b_instruct-fp8-bfloat16] SKIP (https://nvbugs/5141290)
unittest/_torch/auto_deploy/integration/test_lm_eval.py SKIP (https://nvbugs/5144854)
examples/test_qwen.py::test_llm_qwen1_5_moe_plugin_single_gpu_lora[qwen1.5_moe_a2.7b_chat-Upcycled-Qwen1.5-MoE2.7B-LoRA] SKIP (https://nvbugs/5155141)
@ -368,7 +361,6 @@ full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:w4
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:int8_sq_per_tensor] SKIP (https://nvbugspro.nvidia.com/bug/5161074)
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:int8_sq_per_token_channel] SKIP (https://nvbugspro.nvidia.com/bug/5161074)
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_cpp_session-recurrentgemma-2b-use_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5174573)
examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (https://nvbugs/5180961)
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-no_paged_cache-disable_quant-float16-disable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-no_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-use_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
@ -401,6 +393,9 @@ perf/test_perf.py::test_perf[t5-bench-float16-input_output_len:128,20-gpus:2] SK
perf/test_perf.py::test_perf[t5-bench-float16-maxbs:1-input_output_len:128,20-gpus:2] SKIP
perf/test_perf.py::test_perf[gpt_20b-bench-float16-maxbs:8-input_output_len:128,128-reqs:80-gpus:8] SKIP
perf/test_perf.py::test_perf[gpt_20b-bench-float16-maxbs:8-input_output_len:512,32-reqs:80-gpus:8] SKIP
full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:512-input_output_len:128,128-ep:8-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:1-input_output_len:1000,2000-reqs:10-ep:4-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:384-maxnt:1536-input_output_len:1000,2000-reqs:49152-con:3072-ep:8-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[deepseek_v3_lite_fp8-bench-pytorch-float8-input_output_len:128,128] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[mixtral_8x7b_v0.1_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-tp:2-gpus:2] SKIP #https://docs.google.com/spreadsheets/d/1EvwCcJ5o2zmhVxFFxAAz-49UzswMlfN2y5K37Fkyw7A/edit?gid=907483661#gid=907483661
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[llama_v3.3_nemotron_49b-bench-pytorch-bfloat16-input_output_len:128,128-tp:2-gpus:2] SKIP #https://docs.google.com/spreadsheets/d/1EvwCcJ5o2zmhVxFFxAAz-49UzswMlfN2y5K37Fkyw7A/edit?gid=907483661#gid=907483661
@ -413,6 +408,7 @@ accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache SKIP (https://n
test_e2e.py::test_ptp_quickstart_multimodal[NVILA-8B-FP16-vila/NVILA-8B-image] SKIP (https://nvbugs/5233423)
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[ep4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[tp4-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False] SKIP (https://nvbugs/5294983)
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[tp4-mtp_nextn=2-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[ep4-mtp_nextn=2-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False] SKIP (https://nvbugs/5234002)
@ -426,13 +422,13 @@ examples/test_bert.py::test_llm_bert_general[compare_hf-enable_remove_input_padd
examples/test_bert.py::test_llm_bert_general[compare_hf-enable_remove_input_padding-use_attention_plugin-enable_context_fmha-tp:2-pp:1-float16-RobertaForQuestionAnswering-bert/roberta-base-squad2] SKIP (https://nvbugs/5234058)
disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0] SKIP (https://nvbugs/5247271)
disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_tp1_attention_dp_overlap_one_mtp[DeepSeek-V3-Lite-fp8] SKIP (https://nvbugspro.nvidia.com/bug/5273945)
unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-disable_adp-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep4-enable_adp-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
accuracy/test_cli_flow.py::TestLlama3_1_8B::test_tp4[enable_gemm_allreduce_plugin] SKIP (https://nvbugs/5247786)
full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen1.5_7b_chat-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2_7b_instruct-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2.5_7b_chat-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
full:B200/examples/test_mixtral.py::test_llm_mixtral_pp_reduce_scatter_4gpus[Mixtral-8x7B-v0.1] SKIP (https://nvbugs/5247837)
examples/test_qwen.py::test_llm_qwen_smooth_quant_single_gpu_summary[qwen2_vl_7b_instruct-enable_ptpc-nb:4] SKIP (https://nvbugs/5273694)
accuracy/test_cli_flow.py::TestMixtral8x22B::test_int8_plugin_tp8[renormalize-tensor_parallel] SKIP (https://nvbugs/5273695)
test_e2e.py::test_ptp_quickstart_advanced_8gpus[Nemotron-Ultra-253B-nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1] SKIP (https://nvbugs/5273697)
examples/test_gpt.py::test_starcoder_fp8_quantization_2gpu[starcoder] SKIP (https://nvbugs/5144931)
@ -440,7 +436,6 @@ examples/test_gpt.py::test_starcoder_fp8_quantization_2gpu[starcoderplus] SKIP (
unittest/_torch -k "not (modeling or multi_gpu or auto_deploy)" SKIP (https://nvbugs/5280806)
examples/test_whisper.py::test_llm_whisper_general[large-v3-disable_gemm_plugin-disable_attention_plugin-disable_weight_only-float16-nb:1-use_python_runtime] SKIP (https://nvbugs/5244570)
unittest/_torch/speculative/test_eagle3.py SKIP (https://nvbugs/5280806)
test_e2e.py::test_ptp_quickstart_multimodal[qwen2-vl-7b-instruct-Qwen2-VL-7B-Instruct-image] SKIP (https://nvbugs/5226211)
triton_server/test_triton_rcca.py::test_mistral_beam_search[rcca_4714407-True-10-False-True-False-0-128-disableDecoupleMode-inflight_fused_batching-disableTrtOverlap-guaranteed_no_evict-1-1-1-False-ensemble] SKIP (https://nvbugs/5240060)
triton_server/test_triton.py::test_triton_extensive[triton-extensive] SKIP
triton_server/test_triton.py::test_gpt_speculative_decoding[gpt-speculative-decoding] SKIP

View File

@ -19,15 +19,22 @@ from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
@pytest.mark.parametrize("tp_size", [1, 8], ids=["tp1", "tp8"])
@pytest.mark.parametrize("use_cuda_graph", [True, False],
ids=["enable_graph", "disable_graph"])
@pytest.mark.parametrize("enable_attention_dp", [True, False],
ids=["enable_adp", "disable_adp"])
@pytest.mark.parametrize("ep_size", [4, 1], ids=["ep4", "ep1"])
@pytest.mark.parametrize("pp_size", [1, 8], ids=["pp1", "pp8"])
def test_llama4(model_name, backend, tp_size, use_cuda_graph, ep_size, pp_size):
def test_llama4(model_name, backend, tp_size, use_cuda_graph,
enable_attention_dp, ep_size, pp_size):
if pp_size > 1 and (ep_size > 1 or tp_size > 1):
return
if pp_size == 1 and tp_size == 1:
return
if enable_attention_dp and not (tp_size == 8 and ep_size == 4
and pp_size == 1):
pytest.skip("Skip this attention DP test case to avoid too many tests")
prompts = [{
"prompt": "The president of the United States is"
}, {
@ -52,6 +59,7 @@ def test_llama4(model_name, backend, tp_size, use_cuda_graph, ep_size, pp_size):
moe_tensor_parallel_size=tp_size // ep_size,
pytorch_backend_config=pytorch_config,
pipeline_parallel_size=pp_size,
enable_attention_dp=enable_attention_dp,
)
with llm:
outputs = llm.generate(