mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
Release 0.20 to main (#4577)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: stnie <82932102+stnie@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
This commit is contained in:
parent
b800adc65c
commit
fbec0c3552
@ -1,7 +1,7 @@
|
||||
version: "3.9"
|
||||
services:
|
||||
tensorrt_llm-dev:
|
||||
image: urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400
|
||||
image: urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539
|
||||
network_mode: host
|
||||
ipc: host
|
||||
|
||||
|
||||
@ -1,2 +1,9 @@
|
||||
# These vulnerabilities were inherited from the base image (pytorch:25.05-py3) and should be removed when the base image
|
||||
# is updated.
|
||||
|
||||
# WAR against https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
|
||||
h11>=0.16.0
|
||||
# WAR against https://github.com/advisories/GHSA-7cx3-6m66-7c5m
|
||||
tornado>=6.5.0
|
||||
# WAR against https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
|
||||
setuptools>=78.1.1
|
||||
|
||||
@ -72,9 +72,14 @@ RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh
|
||||
RUN pip3 uninstall -y opencv && rm -rf /usr/local/lib/python3*/dist-packages/cv2/
|
||||
RUN pip3 install opencv-python-headless --force-reinstall --no-deps --no-cache-dir
|
||||
|
||||
# WAR against https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
|
||||
RUN pip3 install --upgrade h11>=0.16 --no-cache-dir
|
||||
|
||||
# WARs against security issues inherited from pytorch:25.04
|
||||
# * https://github.com/advisories/GHSA-vqfr-h8mv-ghfj
|
||||
# * https://github.com/advisories/GHSA-7cx3-6m66-7c5m
|
||||
# * https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
|
||||
RUN pip3 install --upgrade --no-cache-dir \
|
||||
"h11>=0.16" \
|
||||
"tornado>=6.5.0" \
|
||||
"setuptools>=78.1.1,<80"
|
||||
|
||||
FROM ${TRITON_IMAGE}:${TRITON_BASE_TAG} AS triton
|
||||
|
||||
@ -173,5 +178,9 @@ RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh
|
||||
FROM release AS tritonrelease
|
||||
|
||||
WORKDIR /app/tensorrt_llm
|
||||
COPY ./triton_backend/ ./triton_backend/
|
||||
COPY ./triton_backend/all_models ./triton_backend/all_models
|
||||
COPY ./triton_backend/scripts ./triton_backend/scripts
|
||||
COPY ./triton_backend/tools ./triton_backend/tools
|
||||
COPY ./triton_backend/inflight_batcher_llm/scripts ./triton_backend/inflight_batcher_llm/scripts
|
||||
COPY ./triton_backend/inflight_batcher_llm/client ./triton_backend/inflight_batcher_llm/client
|
||||
COPY --from=tritonbuild /opt/tritonserver/backends/tensorrtllm /opt/tritonserver/backends/tensorrtllm
|
||||
|
||||
75
docs/source/advanced/kv-cache-management.md
Normal file
75
docs/source/advanced/kv-cache-management.md
Normal file
@ -0,0 +1,75 @@
|
||||
(kv-cache-management)=
|
||||
|
||||
# KV Cache Management: Pools, Blocks, and Events
|
||||
|
||||
This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.
|
||||
|
||||
For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).
|
||||
|
||||
---
|
||||
|
||||
## Hierarchy: Pool, Block, and Page
|
||||
|
||||
### **Block**
|
||||
- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
|
||||
- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
|
||||
- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.
|
||||
|
||||
### **Page**
|
||||
- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
|
||||
- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.
|
||||
|
||||
### **Pool**
|
||||
- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
|
||||
- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
|
||||
- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
|
||||
- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.
|
||||
|
||||
### **WindowBlockManager/BlockManager**
|
||||
|
||||
TRT-LLM supports 2 complex features related to KV cache management:
|
||||
1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
|
||||
2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.
|
||||
|
||||
In order to support both of these features, the pool management works as described below.
|
||||
|
||||
But in the simple, *most common case*, for most models, where
|
||||
1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
|
||||
2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,
|
||||
|
||||
only a *single* pool will be created within the structure described below.
|
||||
|
||||
#### KV Cache Pool Management
|
||||
|
||||
- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
|
||||
- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.
|
||||
|
||||
**Hierarchy Summary:**
|
||||
- **Pool** (memory buffer for KV data)
|
||||
- Contains many blocks.
|
||||
- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
|
||||
- (Optionally, blocks can be swapped between primary/secondary pools.)
|
||||
- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.
|
||||
|
||||
---
|
||||
|
||||
## Events in `KVCacheEventManager`
|
||||
|
||||
The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.
|
||||
|
||||
### **Types of Events**
|
||||
- **Created Event:** When pools or blocks are created/allocated.
|
||||
- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
|
||||
- **Removed Event:** When a block is removed from the cache (evicted or released).
|
||||
- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).
|
||||
|
||||
### **What Triggers an Event?**
|
||||
- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
|
||||
- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
|
||||
- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
|
||||
- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).
|
||||
|
||||
**In summary:**
|
||||
An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.
|
||||
|
||||
---
|
||||
@ -104,6 +104,7 @@ Welcome to TensorRT-LLM's Documentation!
|
||||
advanced/inference-request.md
|
||||
advanced/lora.md
|
||||
advanced/expert-parallelism.md
|
||||
advanced/kv-cache-management.md
|
||||
advanced/kv-cache-reuse.md
|
||||
advanced/speculative-decoding.md
|
||||
advanced/disaggregated-service.md
|
||||
|
||||
@ -4,6 +4,8 @@ In Transformer-based models, the KV (Key-Value) Cache is a mechanism used to opt
|
||||
Since KV Cache requires memory to store, it is also an important resource.
|
||||
In TensorRT-LLM, KV Cache is managed by the `KVCacheManager`.
|
||||
|
||||
For details of the TensorRT-LLM `KVCacheManager` implementation see [KV Cache Management](../advanced/kv-cache-management.md).
|
||||
|
||||
## KV Cache Manager Introduction
|
||||
|
||||
`KVCacheManager` is a type of resource manager, inheriting from `BaseResourceManager`.
|
||||
|
||||
@ -1,24 +0,0 @@
|
||||
#!/bin/bash
|
||||
dataset="template_trtllm_openai_completions.json"
|
||||
output_folder="output_loadgen"
|
||||
port=8000
|
||||
host="localhost"
|
||||
max_count=256
|
||||
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
|
||||
streaming="False"
|
||||
input_tokens=128
|
||||
output_tokens=128
|
||||
concurrency=32
|
||||
|
||||
infserver_loadgen ${dataset} \
|
||||
--output_dir "${output_folder}" \
|
||||
--set dataset.input_tokens:int="${input_tokens}" \
|
||||
--set dataset.output_tokens:int="${output_tokens}" \
|
||||
--set dataset.max_count:int="${max_count}" \
|
||||
--set dataset.model_name:str="${model_name}" \
|
||||
--set dataset.max_concurrent_requests:int="${concurrency}" \
|
||||
--set inference_server.host:str="${host}" \
|
||||
--set inference_server.port:int="${port}" \
|
||||
--set post_processors[0].model_name:str="${model_name}" \
|
||||
--set timing_strategy.desired_rps:float="-1" \
|
||||
--set inference_server.inference_server_config.stream:bool="${streaming}"
|
||||
@ -1,24 +0,0 @@
|
||||
{
|
||||
"dataset": {
|
||||
"type": "fixed_isl_osl"
|
||||
},
|
||||
"inference_server": {
|
||||
"type": "trtllm_openai_completions",
|
||||
"host": "test",
|
||||
"port": null,
|
||||
"inference_server_config": {
|
||||
"model_name": "test"
|
||||
}
|
||||
},
|
||||
"timing_strategy": {
|
||||
"type": "fixed",
|
||||
"desired_rps": -1
|
||||
},
|
||||
"post_processors": [
|
||||
{
|
||||
"type": "infbench_summary",
|
||||
"model_name": "test"
|
||||
}
|
||||
],
|
||||
"timeout": null
|
||||
}
|
||||
@ -128,7 +128,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq \
|
||||
--output_dir ./tmp/llama/7B/trt_engines/weight_only/1-gpu/ \
|
||||
--gemm_plugin auto
|
||||
|
||||
# Build LLaMA 7B using 2-way auto parallelism.
|
||||
# Build LLaMA 7B using 2-way auto parallelism (deprecated).
|
||||
python convert_checkpoint.py --model_dir ./tmp/llama/7B/ \
|
||||
--output_dir ./tllm_checkpoint_1gpu_fp16 \
|
||||
--dtype float16
|
||||
|
||||
@ -30,6 +30,9 @@ from utils import (DEFAULT_HF_MODEL_DIRS, add_common_args, get_beam_width_array,
|
||||
import tensorrt_llm
|
||||
import tensorrt_llm.profiler as profiler
|
||||
from tensorrt_llm._utils import mpi_broadcast, str_dtype_to_torch
|
||||
from tensorrt_llm.builder import EngineConfig
|
||||
from tensorrt_llm.functional import RopeEmbeddingUtils, RotaryScalingType
|
||||
from tensorrt_llm.layers import MropeParams
|
||||
from tensorrt_llm.logger import logger
|
||||
from tensorrt_llm.models.qwen.utils import make_context
|
||||
from tensorrt_llm.runtime import PYTHON_BINDINGS, ModelRunner
|
||||
@ -41,6 +44,42 @@ if PYTHON_BINDINGS:
|
||||
from prompt_lookup.run_dtm_pld import run_dtm_pld
|
||||
|
||||
|
||||
def ensemble_mrope_params(batch_input_ids, max_position_embeddings,
|
||||
rotary_embedding_dim, theta):
|
||||
mrope_params = MropeParams()
|
||||
batch_size = len(batch_input_ids)
|
||||
|
||||
_, rotary_cos_sin = RopeEmbeddingUtils.create_sinusoidal_positions_for_attention_plugin(
|
||||
num_pos=max_position_embeddings,
|
||||
dim=rotary_embedding_dim,
|
||||
theta=1000000.0,
|
||||
scale_type=RotaryScalingType.mrope,
|
||||
)
|
||||
rotary_cos_sin = torch.tensor(rotary_cos_sin).to(batch_input_ids[0].device)
|
||||
rotary_cos_sin = rotary_cos_sin.reshape(max_position_embeddings,
|
||||
int(rotary_embedding_dim / 2), 2)
|
||||
|
||||
cos_ori = rotary_cos_sin[:, :, 0]
|
||||
sin_ori = rotary_cos_sin[:, :, 1]
|
||||
|
||||
mrope_position_ids_padding = torch.zeros(
|
||||
(batch_size, max_position_embeddings), dtype=torch.int32)
|
||||
for i in range(batch_size):
|
||||
seq_len = batch_input_ids[i].shape[-1]
|
||||
mrope_position_ids_padding[i, :seq_len] = torch.arange(
|
||||
seq_len, device=batch_input_ids[i].device)
|
||||
|
||||
cos = cos_ori[mrope_position_ids_padding].unsqueeze(-1)
|
||||
sin = sin_ori[mrope_position_ids_padding].unsqueeze(-1)
|
||||
|
||||
mrope_params.mrope_rotary_cos_sin = torch.concatenate(
|
||||
(cos, sin), axis=-1).reshape(batch_size, -1)
|
||||
mrope_params.mrope_position_deltas = torch.zeros(
|
||||
[batch_size, 1], device=batch_input_ids[0].device)
|
||||
|
||||
return mrope_params
|
||||
|
||||
|
||||
def main(args):
|
||||
is_integration_test = os.getenv('INTEGRATION_TEST', '0') == '1'
|
||||
if is_integration_test:
|
||||
@ -262,7 +301,19 @@ def main(args):
|
||||
eval_task=eval_task,
|
||||
add_special_tokens=add_special_tokens,
|
||||
min_input_length=min_input_length)
|
||||
batch_size = len(batch_input_ids)
|
||||
# Generate mrope params for qwen model
|
||||
engine_config = EngineConfig.from_json_file(
|
||||
f"{args.engine_dir}/config.json")
|
||||
pretrain_config = engine_config.pretrained_config
|
||||
mrope_params = None
|
||||
if 'qwen' in model_name.lower():
|
||||
mrope_params = ensemble_mrope_params(
|
||||
batch_input_ids,
|
||||
max_position_embeddings=pretrain_config.max_position_embeddings,
|
||||
rotary_embedding_dim=pretrain_config.rotary_embedding_dim,
|
||||
theta=pretrain_config.rotary_base,
|
||||
)
|
||||
|
||||
if batch_size == 0:
|
||||
return [], [], [], {}
|
||||
input_lengths = [x.size(0) for x in batch_input_ids]
|
||||
@ -309,7 +360,8 @@ def main(args):
|
||||
return_dict=True,
|
||||
random_seed=random_seed,
|
||||
medusa_choices=args.medusa_choices,
|
||||
eagle_choices=args.eagle_choices)
|
||||
eagle_choices=args.eagle_choices,
|
||||
mrope_params=mrope_params)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Extract a list of tensors of shape beam_width x output_ids.
|
||||
|
||||
@ -28,10 +28,10 @@ UPLOAD_PATH = env.uploadPath ? env.uploadPath : "sw-tensorrt-generic/llm-artifac
|
||||
// Container configuration
|
||||
// available tags can be found in: https://urm.nvidia.com/artifactory/sw-tensorrt-docker/tensorrt-llm/
|
||||
// [base_image_name]-[arch]-[os](-[python_version])-[trt_version]-[torch_install_type]-[stage]-[date]-[mr_id]
|
||||
LLM_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
|
||||
LLM_SBSA_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-aarch64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
|
||||
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py310-trt10.10.0.31-skip-tritondevel-202505191345-4400"
|
||||
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py312-trt10.10.0.31-skip-tritondevel-202505191345-4400"
|
||||
LLM_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
|
||||
LLM_SBSA_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-aarch64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
|
||||
LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py310-trt10.10.0.31-skip-tritondevel-202505211401-4539"
|
||||
LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:cuda-12.9.0-devel-rocky8-x86_64-rocky8-py312-trt10.10.0.31-skip-tritondevel-202505211401-4539"
|
||||
|
||||
// TODO: Move common variables to an unified location
|
||||
BUILD_CORES_REQUEST = "8"
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
|
||||
import java.lang.InterruptedException
|
||||
|
||||
DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505191345-4400"
|
||||
DOCKER_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.04-py3-x86_64-ubuntu24.04-trt10.10.0.31-skip-tritondevel-202505211401-4539"
|
||||
|
||||
def createKubernetesPodConfig(image)
|
||||
{
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
import math
|
||||
import os
|
||||
import threading
|
||||
from itertools import accumulate
|
||||
from typing import List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
@ -116,6 +117,24 @@ def get_output_info(input: torch.Tensor, dim: int) -> List[int]:
|
||||
return {'output_shape': output_shape, 'numel_base': numel_base}
|
||||
|
||||
|
||||
def filter_valid_input(
|
||||
input_list: List[torch.Tensor]
|
||||
) -> Tuple[List[torch.Tensor], List[bool]]:
|
||||
func_valid = lambda x: x is not None
|
||||
valid_list = list(map(func_valid, input_list))
|
||||
input_list = list(filter(func_valid, input_list))
|
||||
return input_list, valid_list
|
||||
|
||||
|
||||
def restore_full_output(output_list: List[torch.Tensor],
|
||||
valid_list: List[bool]) -> List[torch.Tensor]:
|
||||
index_list = list(accumulate(map(int, valid_list)))
|
||||
output_list = list(
|
||||
map(lambda valid, index: output_list[index - 1]
|
||||
if valid else None, valid_list, index_list))
|
||||
return output_list
|
||||
|
||||
|
||||
def allgather(
|
||||
input: Union[torch.Tensor, List[torch.Tensor]],
|
||||
mapping: Mapping,
|
||||
@ -155,8 +174,10 @@ def allgather(
|
||||
if isinstance(input, torch.Tensor):
|
||||
assert input.shape[dim] == sizes[mapping.tp_rank]
|
||||
else:
|
||||
assert all(
|
||||
[val.shape[dim] == sizes[mapping.tp_rank] for val in input])
|
||||
assert all([
|
||||
val.shape[dim] == sizes[mapping.tp_rank] for val in input
|
||||
if val is not None
|
||||
])
|
||||
# 'sizes' is not needed if all inputs in the same TP group have the same shape
|
||||
for split_size in sizes[1:]:
|
||||
if split_size != sizes[0]:
|
||||
@ -170,6 +191,7 @@ def allgather(
|
||||
output_info = get_output_info(input, dim)
|
||||
input = input.contiguous().view(-1, output_info['numel_base'])
|
||||
else:
|
||||
input, valid = filter_valid_input(input)
|
||||
torch_op = torch.ops.trtllm.allgather_list
|
||||
output_info = [get_output_info(val, dim) for val in input]
|
||||
input = [
|
||||
@ -202,6 +224,7 @@ def allgather(
|
||||
convert_output(val, val_info)
|
||||
for val, val_info in zip(output, output_info)
|
||||
]
|
||||
output = restore_full_output(output, valid)
|
||||
return output
|
||||
|
||||
|
||||
@ -220,7 +243,10 @@ def reducescatter(
|
||||
if isinstance(input, torch.Tensor):
|
||||
assert input.shape[dim] == sum_split_size
|
||||
else:
|
||||
assert all([val.shape[dim] == sum_split_size for val in input])
|
||||
assert all([
|
||||
val.shape[dim] == sum_split_size for val in input
|
||||
if val is not None
|
||||
])
|
||||
# 'sizes' is not needed if all outputs in the same TP group have the same shape
|
||||
for split_size in sizes[1:]:
|
||||
if split_size != sizes[0]:
|
||||
@ -245,6 +271,7 @@ def reducescatter(
|
||||
output_info = get_output_info(input, dim)
|
||||
input = convert_input(input, output_info)
|
||||
else:
|
||||
input, valid = filter_valid_input(input)
|
||||
torch_op = torch.ops.trtllm.reducescatter_list
|
||||
output_info = [get_output_info(val, dim) for val in input]
|
||||
input = [
|
||||
@ -265,6 +292,7 @@ def reducescatter(
|
||||
val.view(val_info['output_shape'])
|
||||
for val, val_info in zip(output, output_info)
|
||||
]
|
||||
output = restore_full_output(output, valid)
|
||||
return output
|
||||
|
||||
|
||||
|
||||
@ -1124,19 +1124,13 @@ class FusedMoE(nn.Module):
|
||||
|
||||
if self.use_dp and self.parallel_size > 1 and not disable_fp4_allgather(
|
||||
) and not self.enable_alltoall:
|
||||
if x_sf is None:
|
||||
x, token_selected_slots, token_final_scales = allgather(
|
||||
[x, token_selected_slots, token_final_scales],
|
||||
self.mapping,
|
||||
dim=0,
|
||||
sizes=None if use_dp_padding else all_rank_num_tokens)
|
||||
else:
|
||||
# Fp4 gemm has extra scaling factor
|
||||
x, x_sf, token_selected_slots, token_final_scales = allgather(
|
||||
[x, x_sf, token_selected_slots, token_final_scales],
|
||||
self.mapping,
|
||||
dim=0,
|
||||
sizes=None if use_dp_padding else all_rank_num_tokens)
|
||||
x, x_sf, token_selected_slots, token_final_scales = allgather(
|
||||
[x, x_sf, token_selected_slots, token_final_scales],
|
||||
self.mapping,
|
||||
dim=0,
|
||||
sizes=None if use_dp_padding else all_rank_num_tokens)
|
||||
# Fp4 gemm has extra scaling factor
|
||||
if x_sf is not None:
|
||||
x_sf = reswizzle_sf(x_sf, x_row, x_col,
|
||||
self.scaling_vector_size)
|
||||
|
||||
|
||||
@ -149,6 +149,9 @@ def infer_builder_flags(network):
|
||||
|
||||
|
||||
def auto_parallel(network: Network, config: AutoParallelConfig):
|
||||
logger.warning(
|
||||
"auto_parallel is deprecated, "
|
||||
"please use explicit parallelism like tp_size/pp_size instead.")
|
||||
debug_mode = config.debug_mode
|
||||
memory_budget = config.get_cluster_info(
|
||||
).memory_budget_per_device * 1024 * 1024 * 1024
|
||||
|
||||
@ -1359,11 +1359,19 @@ class BaseLlmArgs(BaseModel):
|
||||
|
||||
class TrtLlmArgs(BaseLlmArgs):
|
||||
|
||||
auto_parallel: bool = Field(default=False,
|
||||
description="Enable auto parallel mode.")
|
||||
auto_parallel: bool = Field(
|
||||
default=False,
|
||||
description="Enable auto parallel mode.",
|
||||
deprecated=
|
||||
"Use tensor_parallel_size/pipeline_parallel_size/xxx_parallel_size instead.",
|
||||
)
|
||||
|
||||
auto_parallel_world_size: Optional[int] = Field(
|
||||
default=None, description="The world size for auto parallel mode.")
|
||||
default=None,
|
||||
description="The world size for auto parallel mode.",
|
||||
deprecated=
|
||||
"Use tensor_parallel_size/pipeline_parallel_size/xxx_parallel_size instead.",
|
||||
)
|
||||
|
||||
enable_tqdm: bool = Field(default=False,
|
||||
description="Enable tqdm for progress bar.")
|
||||
|
||||
@ -434,6 +434,9 @@ class CliFlowAccuracyTestHarness:
|
||||
f"--dtype={self.dtype}",
|
||||
]
|
||||
|
||||
if "nemotron_nas" in self.EXAMPLE_FOLDER:
|
||||
convert_cmd.append("--trust_remote_code")
|
||||
|
||||
if self.MODEL_FORMAT == "NEMO":
|
||||
convert_cmd.append(f"--nemo_ckpt_path={self.MODEL_PATH}")
|
||||
else:
|
||||
|
||||
@ -137,6 +137,8 @@ meta-llama/Llama-3.2-1B:
|
||||
- quant_algo: FP8
|
||||
kv_cache_quant_algo: FP8
|
||||
accuracy: 27.029
|
||||
- quant_algo: FP8
|
||||
accuracy: 27.029
|
||||
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
|
||||
accuracy: 27.257
|
||||
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
|
||||
@ -310,5 +312,3 @@ Qwen3/Qwen3-8B:
|
||||
accuracy: 30
|
||||
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
|
||||
- accuracy: 34.003
|
||||
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
|
||||
- accuracy: 27.810
|
||||
|
||||
@ -16,3 +16,12 @@ deepseek-ai/DeepSeek-R1:
|
||||
accuracy: 70.45
|
||||
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
|
||||
- accuracy: 44.95
|
||||
- quant_algo: FP8
|
||||
accuracy: 49.49
|
||||
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
|
||||
- accuracy: 40.40
|
||||
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
|
||||
- accuracy: 58.08
|
||||
- quant_algo: FP8
|
||||
kv_cache_quant_algo: FP8
|
||||
accuracy: 57.07
|
||||
|
||||
@ -72,5 +72,14 @@ Qwen3/Qwen3-235B-A22B:
|
||||
accuracy: 85.78
|
||||
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
|
||||
- accuracy: 92.57
|
||||
- quant_algo: FP8
|
||||
accuracy: 92.42
|
||||
nvidia/Nemotron-H-8B-Base-8K:
|
||||
- accuracy: 46.20
|
||||
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
|
||||
- accuracy: 37.15
|
||||
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
|
||||
- accuracy: 94.43
|
||||
- quant_algo: FP8
|
||||
kv_cache_quant_algo: FP8
|
||||
accuracy: 94.16
|
||||
|
||||
@ -28,6 +28,26 @@ meta-llama/Llama-3.1-8B-Instruct:
|
||||
- quant_algo: FP8
|
||||
kv_cache_quant_algo: FP8
|
||||
accuracy: 67.87
|
||||
meta-llama/Llama-3.2-1B:
|
||||
- quant_algo: W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN
|
||||
accuracy: 32.72
|
||||
- quant_algo: W8A8_SQ_PER_CHANNEL
|
||||
accuracy: 32.07
|
||||
- quant_algo: W4A16_AWQ
|
||||
accuracy: 30.56
|
||||
- quant_algo: W4A16_AWQ
|
||||
kv_cache_quant_algo: INT8
|
||||
accuracy: 31.29
|
||||
- quant_algo: FP8
|
||||
kv_cache_quant_algo: FP8
|
||||
accuracy: 31.02
|
||||
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
|
||||
accuracy: 33.97
|
||||
- quant_algo: FP8_PER_CHANNEL_PER_TOKEN
|
||||
extra_acc_spec: meta_recipe
|
||||
accuracy: 33.87
|
||||
- extra_acc_spec: max_attention_window_size=960
|
||||
accuracy: 32.82
|
||||
meta-llama/Llama-3.3-70B-Instruct:
|
||||
- accuracy: 81.31
|
||||
- quant_algo: NVFP4
|
||||
@ -128,9 +148,16 @@ Qwen3/Qwen3-235B-A22B:
|
||||
accuracy: 86
|
||||
nvidia/Llama-3_3-Nemotron-Super-49B-v1:
|
||||
- accuracy: 79.43
|
||||
- quant_algo: FP8
|
||||
accuracy: 79.26
|
||||
nvidia/Llama-3.1-Nemotron-Nano-8B-v1:
|
||||
- accuracy: 57.97
|
||||
nvidia/Nemotron-H-8B-Base-8K:
|
||||
- accuracy: 69.590
|
||||
microsoft/Phi-4-mini-instruct:
|
||||
- accuracy: 68.98
|
||||
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1:
|
||||
- accuracy: 83.70
|
||||
- quant_algo: FP8
|
||||
kv_cache_quant_algo: FP8
|
||||
accuracy: 83.36
|
||||
|
||||
@ -200,6 +200,97 @@ class TestNemotronMini4BInstruct(CliFlowAccuracyTestHarness):
|
||||
self.run(quant_algo=QuantAlgo.FP8, kv_cache_quant_algo=QuantAlgo.FP8)
|
||||
|
||||
|
||||
# TODO: Remove the CLI tests once NIMs use PyTorch backend
|
||||
class TestLlama3_3NemotronSuper49Bv1(CliFlowAccuracyTestHarness):
|
||||
MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"
|
||||
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1"
|
||||
EXAMPLE_FOLDER = "models/core/nemotron_nas"
|
||||
|
||||
@pytest.mark.skip_less_device(2)
|
||||
def test_auto_dtype_tp2(self):
|
||||
self.run(tasks=[MMLU(self.MODEL_NAME)], tp_size=2, dtype='auto')
|
||||
|
||||
@pytest.mark.skip(
|
||||
reason="nemotron-nas scripts have to accommodate fp8 flags")
|
||||
@pytest.mark.skip_less_device(2)
|
||||
@pytest.mark.skip_device_not_contain(["H100", "B200"])
|
||||
def test_fp8_prequantized_tp2(self, mocker):
|
||||
mocker.patch.object(
|
||||
self.__class__, "MODEL_PATH",
|
||||
f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8"
|
||||
)
|
||||
self.run(tasks=[MMLU(self.MODEL_NAME)],
|
||||
tp_size=2,
|
||||
quant_algo=QuantAlgo.FP8)
|
||||
|
||||
|
||||
class TestNemotronNano(CliFlowAccuracyTestHarness):
|
||||
MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
|
||||
MODEL_PATH = f"{llm_models_root()}/Llama-3.1-Nemotron-Nano-8B-v1"
|
||||
EXAMPLE_FOLDER = "models/core/llama"
|
||||
|
||||
def test_auto_dtype(self):
|
||||
self.run(tasks=[MMLU(self.MODEL_NAME)], dtype='auto')
|
||||
|
||||
|
||||
class TestNemotronUltra(CliFlowAccuracyTestHarness):
|
||||
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"
|
||||
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1"
|
||||
EXAMPLE_FOLDER = "models/core/nemotron_nas"
|
||||
|
||||
@skip_pre_hopper
|
||||
@pytest.mark.skip_less_device(8)
|
||||
@pytest.mark.skip_device_not_contain(["H100", "B200"])
|
||||
@parametrize_with_ids("cuda_graph", [False, True])
|
||||
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
|
||||
(8, 1, 8)],
|
||||
ids=["tp8", "tp8ep4", "tp8ep8"])
|
||||
def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
|
||||
extra_summarize_args = []
|
||||
if cuda_graph:
|
||||
extra_summarize_args.append("--cuda_graph_mode")
|
||||
|
||||
self.run(tasks=[MMLU(self.MODEL_NAME)],
|
||||
tp_size=tp_size,
|
||||
pp_size=pp_size,
|
||||
extra_convert_args=[
|
||||
f"--moe_tp_size={tp_size // ep_size}",
|
||||
f"--moe_ep_size={ep_size}", f"--moe_renorm_mode={0}"
|
||||
],
|
||||
extra_build_args=["--gemm_plugin=auto", "--moe_plugin=auto"],
|
||||
extra_summarize_args=extra_summarize_args)
|
||||
|
||||
@skip_pre_hopper
|
||||
@pytest.mark.skip_less_device(8)
|
||||
@pytest.mark.skip_device_not_contain(["H100", "B200"])
|
||||
@parametrize_with_ids("cuda_graph", [False, True])
|
||||
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
|
||||
(8, 1, 8)],
|
||||
ids=["tp8", "tp8ep4", "tp8ep8"])
|
||||
def test_fp8_prequantized(self, cuda_graph, tp_size, pp_size, ep_size,
|
||||
mocker):
|
||||
mocker.patch.object(
|
||||
self.__class__, "MODEL_PATH",
|
||||
f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1-FP8"
|
||||
)
|
||||
|
||||
extra_summarize_args = []
|
||||
if cuda_graph:
|
||||
extra_summarize_args.append("--cuda_graph_mode")
|
||||
|
||||
self.run(tasks=[MMLU(self.MODEL_NAME)],
|
||||
quant_algo=QuantAlgo.FP8,
|
||||
kv_cache_quant_algo=QuantAlgo.FP8,
|
||||
tp_size=tp_size,
|
||||
pp_size=pp_size,
|
||||
extra_convert_args=[
|
||||
f"--moe_tp_size={tp_size // ep_size}",
|
||||
f"--moe_ep_size={ep_size}", f"--moe_renorm_mode={0}"
|
||||
],
|
||||
extra_build_args=["--gemm_plugin=auto", "--moe_plugin=auto"],
|
||||
extra_summarize_args=extra_summarize_args)
|
||||
|
||||
|
||||
@skip_post_blackwell
|
||||
class TestPhi2(CliFlowAccuracyTestHarness):
|
||||
MODEL_NAME = "microsoft/phi-2"
|
||||
@ -847,9 +938,7 @@ class TestLlama3_3_70BInstruct(CliFlowAccuracyTestHarness):
|
||||
@pytest.mark.skip_device_not_contain(["B200"])
|
||||
def test_nvfp4_prequantized_tp4(self, mocker):
|
||||
mocker.patch.object(
|
||||
self.__class__,
|
||||
"MODEL_PATH",
|
||||
model_path=
|
||||
self.__class__, "MODEL_PATH",
|
||||
f"{llm_models_root()}/modelopt-hf-model-hub/Llama-3.3-70B-Instruct-fp4"
|
||||
)
|
||||
self.run(tasks=[MMLU(self.MODEL_NAME)],
|
||||
|
||||
@ -2,12 +2,12 @@
|
||||
# I need to to this by creating a new class that mimics LLM class. Instead of implementing the
|
||||
# actual methods it will send OAI requests to the disaggregated serving endpoint.
|
||||
# Please take a look at the existing test_llm_api_pytorch.py file for reference.
|
||||
|
||||
import concurrent
|
||||
import contextlib
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import tempfile
|
||||
import time
|
||||
from collections import namedtuple
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
@ -16,11 +16,12 @@ import pytest
|
||||
import requests
|
||||
import yaml
|
||||
|
||||
from tensorrt_llm._torch import LLM
|
||||
from tensorrt_llm.executor.result import GenerationResultBase
|
||||
from tensorrt_llm.llmapi import CompletionOutput, RequestOutput, SamplingParams
|
||||
from tensorrt_llm.llmapi.llm_args import LlmArgs
|
||||
|
||||
from ..conftest import llm_models_root
|
||||
from ..trt_test_alternative import popen
|
||||
from .accuracy_core import GSM8K, MMLU, LlmapiAccuracyTestHarness
|
||||
|
||||
|
||||
@ -40,76 +41,85 @@ class Result(GenerationResultBase):
|
||||
return self
|
||||
|
||||
|
||||
class OpenAIServerClient:
|
||||
DuckLLM = namedtuple('DuckLLM', ['args', 'generate_async'])
|
||||
|
||||
def __init__(self,
|
||||
disaggregated_server_config: Dict[str, Any],
|
||||
ctx_server_config: Dict[str, Any],
|
||||
gen_server_config: Dict[str, Any],
|
||||
model_name: str,
|
||||
tensor_parallel_size: int = 1):
|
||||
self.thread_pool = ThreadPoolExecutor(max_workers=16)
|
||||
self.temp_dir = tempfile.mkdtemp()
|
||||
self.futures = []
|
||||
self.disaggregated_serving_config_path = os.path.join(
|
||||
self.temp_dir, "disaggregated_serving_config.yaml")
|
||||
with open(self.disaggregated_serving_config_path, "w") as f:
|
||||
yaml.dump(disaggregated_server_config, f)
|
||||
ctx_server_config_path = os.path.join(self.temp_dir,
|
||||
"ctx_server_config.yaml")
|
||||
with open(ctx_server_config_path, "w") as f:
|
||||
yaml.dump(ctx_server_config, f)
|
||||
gen_server_config_path = os.path.join(self.temp_dir,
|
||||
"gen_server_config.yaml")
|
||||
with open(gen_server_config_path, "w") as f:
|
||||
yaml.dump(gen_server_config, f)
|
||||
|
||||
with LLM(model_name, tensor_parallel_size=tensor_parallel_size) as llm:
|
||||
self.args = llm.args
|
||||
class MyThreadPoolExecutor(ThreadPoolExecutor):
|
||||
|
||||
cuda_device_idx = 0
|
||||
cuda_devices = []
|
||||
for i in range(tensor_parallel_size):
|
||||
cuda_devices.append(f"{cuda_device_idx}")
|
||||
cuda_device_idx += 1
|
||||
def __init__(self, *args, **kwargs) -> None:
|
||||
super().__init__(*args, **kwargs)
|
||||
self.futures: list[concurrent.futures.Future[RequestOutput]] = []
|
||||
|
||||
trtllm_serve_path = "trtllm-serve"
|
||||
# Common arguments for both servers
|
||||
common_args = [
|
||||
trtllm_serve_path, model_name, "--host", "localhost", "--backend",
|
||||
"pytorch"
|
||||
]
|
||||
if tensor_parallel_size > 1:
|
||||
common_args.append(f"--tp_size={tensor_parallel_size}")
|
||||
env_ctx = os.environ.copy()
|
||||
env_ctx["TRTLLM_USE_UCX_KVCACHE"] = "1"
|
||||
env_ctx["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_devices)
|
||||
# Start the context server
|
||||
self._ctx_server = subprocess.Popen(common_args + [
|
||||
"--port", "8001", "--extra_llm_api_options", ctx_server_config_path
|
||||
],
|
||||
env=env_ctx)
|
||||
# Start the generation server
|
||||
env_gen = os.environ.copy()
|
||||
env_gen["TRTLLM_USE_UCX_KVCACHE"] = "1"
|
||||
cuda_devices = []
|
||||
for i in range(tensor_parallel_size):
|
||||
cuda_devices.append(f"{cuda_device_idx}")
|
||||
cuda_device_idx += 1
|
||||
env_gen["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_devices)
|
||||
self._gen_server = subprocess.Popen(common_args + [
|
||||
"--port", "8002", "--extra_llm_api_options", gen_server_config_path
|
||||
],
|
||||
env=env_gen)
|
||||
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||
if exc_type is None:
|
||||
for future in self.futures:
|
||||
future.result()
|
||||
return super().__exit__(exc_type, exc_val, exc_tb)
|
||||
|
||||
# Start the disaggregated server
|
||||
self._disaggregated_server = subprocess.Popen([
|
||||
trtllm_serve_path, "disaggregated", "-c",
|
||||
self.disaggregated_serving_config_path, "--server_start_timeout",
|
||||
"3600"
|
||||
])
|
||||
self.model_name = model_name
|
||||
for future in self.futures:
|
||||
future.cancel()
|
||||
self.shutdown(wait=False, cancel_futures=True)
|
||||
return False
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def launch_disaggregated_llm(disaggregated_server_config: Dict[str, Any],
|
||||
ctx_server_config: Dict[str, Any],
|
||||
gen_server_config: Dict[str, Any],
|
||||
model_name: str,
|
||||
tensor_parallel_size: int = 1):
|
||||
temp_dir = tempfile.TemporaryDirectory()
|
||||
disaggregated_serving_config_path = os.path.join(
|
||||
temp_dir.name, "disaggregated_serving_config.yaml")
|
||||
with open(disaggregated_serving_config_path, "w") as f:
|
||||
yaml.dump(disaggregated_server_config, f)
|
||||
ctx_server_config_path = os.path.join(temp_dir.name,
|
||||
"ctx_server_config.yaml")
|
||||
with open(ctx_server_config_path, "w") as f:
|
||||
yaml.dump(ctx_server_config, f)
|
||||
gen_server_config_path = os.path.join(temp_dir.name,
|
||||
"gen_server_config.yaml")
|
||||
with open(gen_server_config_path, "w") as f:
|
||||
yaml.dump(gen_server_config, f)
|
||||
|
||||
args = LlmArgs.from_kwargs(model=model_name,
|
||||
tensor_parallel_size=tensor_parallel_size)
|
||||
|
||||
trtllm_serve_path = "trtllm-serve"
|
||||
# Common arguments for both servers
|
||||
common_args = [
|
||||
trtllm_serve_path, model_name, "--host", "localhost", "--backend",
|
||||
"pytorch"
|
||||
]
|
||||
if tensor_parallel_size > 1:
|
||||
common_args.append(f"--tp_size={tensor_parallel_size}")
|
||||
|
||||
env_ctx = os.environ.copy()
|
||||
env_ctx["TRTLLM_USE_UCX_KVCACHE"] = "1"
|
||||
env_ctx["CUDA_VISIBLE_DEVICES"] = ",".join(
|
||||
map(str, range(tensor_parallel_size)))
|
||||
|
||||
env_gen = os.environ.copy()
|
||||
env_gen["TRTLLM_USE_UCX_KVCACHE"] = "1"
|
||||
env_gen["CUDA_VISIBLE_DEVICES"] = ",".join(
|
||||
map(str, range(tensor_parallel_size, 2 * tensor_parallel_size)))
|
||||
|
||||
with (MyThreadPoolExecutor(max_workers=16) as thread_pool, temp_dir,
|
||||
popen(common_args + [
|
||||
"--port", "8001", "--extra_llm_api_options",
|
||||
ctx_server_config_path
|
||||
],
|
||||
env=env_ctx) as ctx_server,
|
||||
popen(common_args + [
|
||||
"--port", "8002", "--extra_llm_api_options",
|
||||
gen_server_config_path
|
||||
],
|
||||
env=env_gen) as gen_server,
|
||||
popen([
|
||||
trtllm_serve_path, "disaggregated", "-c",
|
||||
disaggregated_serving_config_path, "--server_start_timeout",
|
||||
"3600"
|
||||
]) as disaggregated_server):
|
||||
while True:
|
||||
time.sleep(1)
|
||||
try:
|
||||
@ -120,54 +130,47 @@ class OpenAIServerClient:
|
||||
except requests.exceptions.ConnectionError:
|
||||
continue
|
||||
|
||||
self.client = openai.OpenAI(api_key="1234567890",
|
||||
base_url=f"http://localhost:8000/v1")
|
||||
client = openai.OpenAI(api_key="1234567890",
|
||||
base_url=f"http://localhost:8000/v1")
|
||||
|
||||
def send_request(self, prompt: str, sampling_params: SamplingParams):
|
||||
response = self.client.completions.create(
|
||||
model=self.model_name,
|
||||
prompt=prompt,
|
||||
stream=False,
|
||||
**({
|
||||
"max_tokens": sampling_params.max_tokens,
|
||||
"temperature": sampling_params.temperature,
|
||||
"top_p": sampling_params.top_p,
|
||||
"stop": sampling_params.stop,
|
||||
"seed": sampling_params.seed
|
||||
} if sampling_params else {}))
|
||||
result = Result(
|
||||
id=0,
|
||||
sampling_params=sampling_params,
|
||||
outputs=[CompletionOutput(text=response.choices[0].text, index=0)])
|
||||
requested_output = RequestOutput._from_generation_result(result,
|
||||
prompt=prompt)
|
||||
setattr(requested_output, "result", result.result)
|
||||
return requested_output
|
||||
def send_request(prompt: str, sampling_params: SamplingParams):
|
||||
response = client.completions.create(
|
||||
model=model_name,
|
||||
prompt=prompt,
|
||||
stream=False,
|
||||
**({
|
||||
"max_tokens": sampling_params.max_tokens,
|
||||
"temperature": sampling_params.temperature,
|
||||
"top_p": sampling_params.top_p,
|
||||
"stop": sampling_params.stop,
|
||||
"seed": sampling_params.seed
|
||||
} if sampling_params else {}))
|
||||
result = Result(id=0,
|
||||
sampling_params=sampling_params,
|
||||
outputs=[
|
||||
CompletionOutput(text=response.choices[0].text,
|
||||
index=0)
|
||||
])
|
||||
requested_output = RequestOutput._from_generation_result(
|
||||
result, prompt=prompt)
|
||||
setattr(requested_output, "result", result.result)
|
||||
return requested_output
|
||||
|
||||
def generate_async(self,
|
||||
prompt: str,
|
||||
sampling_params: Optional[SamplingParams] = None):
|
||||
future = self.thread_pool.submit(self.send_request, prompt,
|
||||
sampling_params)
|
||||
self.futures.append(future)
|
||||
return future
|
||||
def generate_async(prompt: str,
|
||||
sampling_params: Optional[SamplingParams] = None):
|
||||
future = thread_pool.submit(send_request, prompt, sampling_params)
|
||||
thread_pool.futures.append(future)
|
||||
return future
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
yield DuckLLM(args, generate_async)
|
||||
|
||||
def __exit__(self, exc_type, exc_value, traceback):
|
||||
shutil.rmtree(self.temp_dir)
|
||||
self._ctx_server.terminate()
|
||||
self._gen_server.terminate()
|
||||
self._disaggregated_server.terminate()
|
||||
ctx_server.terminate()
|
||||
gen_server.terminate()
|
||||
disaggregated_server.terminate()
|
||||
|
||||
self._ctx_server.wait()
|
||||
self._gen_server.wait()
|
||||
self._disaggregated_server.wait()
|
||||
|
||||
for future in self.futures:
|
||||
future.result()
|
||||
self.thread_pool.shutdown(wait=True)
|
||||
ctx_server.wait()
|
||||
gen_server.wait()
|
||||
disaggregated_server.wait()
|
||||
|
||||
|
||||
class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
@ -201,12 +204,13 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
"urls": ["localhost:8002"]
|
||||
}
|
||||
}
|
||||
with OpenAIServerClient(disaggregated_server_config, ctx_server_config,
|
||||
gen_server_config, self.MODEL_PATH) as client:
|
||||
with launch_disaggregated_llm(disaggregated_server_config,
|
||||
ctx_server_config, gen_server_config,
|
||||
self.MODEL_PATH) as llm:
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(client)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(client)
|
||||
task.evaluate(llm)
|
||||
|
||||
|
||||
class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
|
||||
@ -215,6 +219,7 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
|
||||
|
||||
@pytest.mark.parametrize("overlap_scheduler", [False, True])
|
||||
def test_auto_dtype(self, overlap_scheduler):
|
||||
pytest.skip("https://nvbugs/5297821")
|
||||
ctx_server_config = {
|
||||
"pytorch_backend_config": {
|
||||
"disable_overlap_scheduler": True
|
||||
@ -238,12 +243,12 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
|
||||
"urls": ["localhost:8002"]
|
||||
}
|
||||
}
|
||||
with OpenAIServerClient(disaggregated_server_config,
|
||||
ctx_server_config,
|
||||
gen_server_config,
|
||||
self.MODEL_PATH,
|
||||
tensor_parallel_size=4) as client:
|
||||
with launch_disaggregated_llm(disaggregated_server_config,
|
||||
ctx_server_config,
|
||||
gen_server_config,
|
||||
self.MODEL_PATH,
|
||||
tensor_parallel_size=4) as llm:
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(client)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(client)
|
||||
task.evaluate(llm)
|
||||
|
||||
@ -188,10 +188,11 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@pytest.mark.skip(reason="https://nvbugspro.nvidia.com/bug/5292517")
|
||||
@skip_pre_hopper
|
||||
def test_fp8_llm_decoder(self):
|
||||
def test_fp8_llm_sampler(self):
|
||||
model_path = f"{llm_models_root()}/llama-3.1-model/Llama-3.1-8B-Instruct-FP8"
|
||||
pytorch_config = PyTorchConfig(enable_trtllm_decoder=True)
|
||||
pytorch_config = PyTorchConfig(enable_trtllm_sampler=True)
|
||||
llm = LLM(model_path, pytorch_backend_config=pytorch_config)
|
||||
assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
|
||||
|
||||
@ -207,6 +208,79 @@ class TestLlama3_1_8BInstruct(LlmapiAccuracyTestHarness):
|
||||
extra_acc_spec="temperature=0.8,top_p=0.95")
|
||||
|
||||
|
||||
class TestLlama3_2_1B(LlmapiAccuracyTestHarness):
|
||||
MODEL_NAME = "meta-llama/Llama-3.2-1B"
|
||||
MODEL_PATH = f"{llm_models_root()}/llama-3.2-models/Llama-3.2-1B"
|
||||
EXAMPLE_FOLDER = "models/core/llama"
|
||||
|
||||
def test_auto_dtype(self):
|
||||
with LLM(self.MODEL_PATH) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_post_blackwell
|
||||
def test_smooth_quant(self):
|
||||
quant_config = QuantConfig(
|
||||
QuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN)
|
||||
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_post_blackwell
|
||||
def test_smooth_quant_ootb(self):
|
||||
quant_config = QuantConfig(QuantAlgo.W8A8_SQ_PER_CHANNEL)
|
||||
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_post_blackwell
|
||||
def test_int4_awq(self):
|
||||
quant_config = QuantConfig(QuantAlgo.W4A16_AWQ)
|
||||
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_post_blackwell
|
||||
def test_int4_awq_int8_kv_cache(self):
|
||||
quant_config = QuantConfig(QuantAlgo.W4A16_AWQ)
|
||||
kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.INT8)
|
||||
with LLM(self.MODEL_PATH,
|
||||
quant_config=quant_config,
|
||||
kv_cache_config=kv_cache_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_pre_ada
|
||||
def test_fp8(self):
|
||||
quant_config = QuantConfig(QuantAlgo.FP8)
|
||||
kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.FP8)
|
||||
with LLM(self.MODEL_PATH,
|
||||
quant_config=quant_config,
|
||||
kv_cache_config=kv_cache_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_pre_ada
|
||||
@pytest.mark.skip_less_device(2)
|
||||
def test_fp8_pp2(self):
|
||||
quant_config = QuantConfig(QuantAlgo.FP8)
|
||||
kv_cache_config = KvCacheConfig(quant_algo=QuantAlgo.FP8)
|
||||
with LLM(self.MODEL_PATH,
|
||||
pipeline_parallel_size=2,
|
||||
quant_config=quant_config,
|
||||
kv_cache_config=kv_cache_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
@skip_pre_ada
|
||||
@skip_post_blackwell
|
||||
def test_fp8_rowwise(self):
|
||||
quant_config = QuantConfig(QuantAlgo.FP8_PER_CHANNEL_PER_TOKEN)
|
||||
with LLM(self.MODEL_PATH, quant_config=quant_config) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
|
||||
class TestLlama3_3_70BInstruct(LlmapiAccuracyTestHarness):
|
||||
MODEL_NAME = "meta-llama/Llama-3.3-70B-Instruct"
|
||||
|
||||
@ -924,7 +998,7 @@ class TestNemotronNas(LlmapiAccuracyTestHarness):
|
||||
|
||||
|
||||
@pytest.mark.skip_less_device_memory(80000)
|
||||
class TestNemotronSuper(LlmapiAccuracyTestHarness):
|
||||
class TestLlama3_3NemotronSuper49Bv1(LlmapiAccuracyTestHarness):
|
||||
MODEL_NAME = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"
|
||||
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1"
|
||||
|
||||
@ -939,6 +1013,20 @@ class TestNemotronSuper(LlmapiAccuracyTestHarness):
|
||||
task.evaluate(llm,
|
||||
extra_evaluator_kwargs=dict(apply_chat_template=True))
|
||||
|
||||
@pytest.mark.skip_less_device(2)
|
||||
@pytest.mark.skip_device_not_contain(["H100", "B200"])
|
||||
def test_fp8_prequantized_tp2(self):
|
||||
model_path = f"{llm_models_root()}/nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8"
|
||||
with LLM(model_path, tensor_parallel_size=2) as llm:
|
||||
assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GPQADiamond(self.MODEL_NAME)
|
||||
task.evaluate(llm,
|
||||
extra_evaluator_kwargs=dict(apply_chat_template=True))
|
||||
|
||||
|
||||
class TestNemotronNano(LlmapiAccuracyTestHarness):
|
||||
MODEL_NAME = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
|
||||
@ -946,10 +1034,61 @@ class TestNemotronNano(LlmapiAccuracyTestHarness):
|
||||
|
||||
def test_auto_dtype(self):
|
||||
with LLM(self.MODEL_PATH) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GPQADiamond(self.MODEL_NAME)
|
||||
task.evaluate(llm,
|
||||
extra_evaluator_kwargs=dict(apply_chat_template=True))
|
||||
|
||||
|
||||
class TestNemotronUltra(LlmapiAccuracyTestHarness):
|
||||
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"
|
||||
MODEL_PATH = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1"
|
||||
|
||||
@pytest.mark.skip_less_device(8)
|
||||
@pytest.mark.skip_device_not_contain(["H100", "B200"])
|
||||
@parametrize_with_ids("cuda_graph", [False, True])
|
||||
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
|
||||
(8, 1, 8)],
|
||||
ids=["tp8", "tp8ep4", "tp8ep8"])
|
||||
def test_auto_dtype(self, cuda_graph, tp_size, pp_size, ep_size):
|
||||
with LLM(self.MODEL_PATH,
|
||||
tensor_parallel_size=tp_size,
|
||||
pipeline_parallel_size=pp_size,
|
||||
moe_expert_parallel_size=ep_size,
|
||||
use_cuda_graph=cuda_graph) as llm:
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GPQADiamond(self.MODEL_NAME)
|
||||
task.evaluate(llm,
|
||||
extra_evaluator_kwargs=dict(apply_chat_template=True))
|
||||
|
||||
@pytest.mark.skip_less_device(8)
|
||||
@pytest.mark.skip_device_not_contain(["H100", "B200"])
|
||||
@parametrize_with_ids("cuda_graph", [False, True])
|
||||
@pytest.mark.parametrize("tp_size,pp_size,ep_size", [(8, 1, 1), (8, 1, 4),
|
||||
(8, 1, 8)],
|
||||
ids=["tp8", "tp8ep4", "tp8ep8"])
|
||||
def test_fp8_prequantized(self, cuda_graph, tp_size, pp_size, ep_size):
|
||||
model_path = f"{llm_models_root()}/nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1-FP8"
|
||||
with LLM(model_path,
|
||||
tensor_parallel_size=tp_size,
|
||||
pipeline_parallel_size=pp_size,
|
||||
moe_expert_parallel_size=ep_size,
|
||||
use_cuda_graph=cuda_graph) as llm:
|
||||
assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
|
||||
assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GPQADiamond(self.MODEL_NAME)
|
||||
task.evaluate(llm,
|
||||
extra_evaluator_kwargs=dict(apply_chat_template=True))
|
||||
|
||||
|
||||
class TestNemotronH(LlmapiAccuracyTestHarness):
|
||||
@ -1185,3 +1324,24 @@ class TestQwen3_235B_A22B(LlmapiAccuracyTestHarness):
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
|
||||
|
||||
class TestPhi4MiniInstruct(LlmapiAccuracyTestHarness):
|
||||
MODEL_NAME = "microsoft/Phi-4-mini-instruct"
|
||||
MODEL_PATH = f"{llm_models_root()}/Phi-4-mini-instruct"
|
||||
|
||||
@pytest.mark.skip(
|
||||
reason=
|
||||
"Temporarily skipping test_auto_dtype while resolving Phi-4's architecture issue."
|
||||
)
|
||||
def test_auto_dtype(self):
|
||||
with LLM(self.MODEL_PATH) as llm:
|
||||
task = CnnDailymail(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = MMLU(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GSM8K(self.MODEL_NAME)
|
||||
task.evaluate(llm)
|
||||
task = GPQADiamond(self.MODEL_NAME)
|
||||
task.evaluate(llm,
|
||||
extra_evaluator_kwargs=dict(apply_chat_template=True))
|
||||
|
||||
@ -24,23 +24,23 @@ from packaging import version
|
||||
from .trt_test_alternative import check_call, check_output, exists, is_windows
|
||||
|
||||
|
||||
def venv_check_call(venv, cmd, running_log=None, env=None):
|
||||
def venv_check_call(venv, cmd, env=None, **kwargs):
|
||||
|
||||
def _war_check_call(*args, **kwargs):
|
||||
kwargs["cwd"] = venv.get_working_directory()
|
||||
return check_call(*args, **kwargs)
|
||||
|
||||
venv.run_cmd(cmd, caller=_war_check_call, running_log=running_log, env=env)
|
||||
venv.run_cmd(cmd, caller=_war_check_call, env=env, **kwargs)
|
||||
|
||||
|
||||
def venv_check_output(venv, cmd):
|
||||
def venv_check_output(venv, cmd, env=None, **kwargs):
|
||||
|
||||
def _war_check_output(*args, **kwargs):
|
||||
kwargs["cwd"] = venv.get_working_directory()
|
||||
output = check_output(*args, **kwargs)
|
||||
return output
|
||||
|
||||
return venv.run_cmd(cmd, caller=_war_check_output)
|
||||
return venv.run_cmd(cmd, caller=_war_check_output, env=env, **kwargs)
|
||||
|
||||
|
||||
def venv_mpi_check_call(venv, mpi_cmd, python_cmd):
|
||||
|
||||
@ -22,6 +22,7 @@ import subprocess as sp
|
||||
import tempfile
|
||||
import time
|
||||
import urllib.request
|
||||
import warnings
|
||||
from functools import wraps
|
||||
from pathlib import Path
|
||||
from typing import Iterable, Sequence
|
||||
@ -2196,8 +2197,10 @@ def skip_by_host_memory(request):
|
||||
|
||||
IS_UNDER_CI_ENV = 'JENKINS_HOME' in os.environ
|
||||
|
||||
gpu_warning_threshold = 1024 * 1024 * 1024
|
||||
|
||||
def collect_status():
|
||||
|
||||
def collect_status(item: pytest.Item):
|
||||
if not IS_UNDER_CI_ENV:
|
||||
return
|
||||
|
||||
@ -2210,6 +2213,22 @@ def collect_status():
|
||||
for idx in range(pynvml.nvmlDeviceGetCount())
|
||||
}
|
||||
|
||||
deadline = time.perf_counter() + 60 # 1 min
|
||||
observed_used = 0
|
||||
global gpu_warning_threshold
|
||||
|
||||
while time.perf_counter() < deadline:
|
||||
observed_used = max(
|
||||
pynvml.nvmlDeviceGetMemoryInfo(device).used
|
||||
for device in handles.values())
|
||||
if observed_used <= gpu_warning_threshold:
|
||||
break
|
||||
time.sleep(1)
|
||||
else:
|
||||
gpu_warning_threshold = max(observed_used, gpu_warning_threshold)
|
||||
warnings.warn(
|
||||
f"Test {item.name} does not free up GPU memory correctly!")
|
||||
|
||||
gpu_memory = {}
|
||||
for idx, device in handles.items():
|
||||
total_used = pynvml.nvmlDeviceGetMemoryInfo(device).used // 1024 // 1024
|
||||
@ -2218,13 +2237,12 @@ def collect_status():
|
||||
process = {}
|
||||
|
||||
for entry in detail:
|
||||
host_memory_in_mbs = -1
|
||||
try:
|
||||
host_memory_in_mbs = psutil.Process(
|
||||
entry.pid).memory_full_info().uss // 1024 // 1024
|
||||
p = psutil.Process(entry.pid)
|
||||
host_memory_in_mbs = p.memory_full_info().uss // 1024 // 1024
|
||||
process[entry.pid] = (entry.usedGpuMemory // 1024 // 1024,
|
||||
host_memory_in_mbs)
|
||||
except:
|
||||
host_memory_in_mbs, p.cmdline())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
gpu_memory[idx] = {
|
||||
@ -2239,7 +2257,7 @@ def collect_status():
|
||||
@pytest.hookimpl(wrapper=True)
|
||||
def pytest_runtest_protocol(item, nextitem):
|
||||
ret = yield
|
||||
collect_status()
|
||||
collect_status(item)
|
||||
return ret
|
||||
|
||||
|
||||
|
||||
@ -18,14 +18,7 @@ import subprocess
|
||||
|
||||
import pytest
|
||||
from defs.conftest import skip_no_hopper
|
||||
|
||||
|
||||
def kill_disaggregated_processes():
|
||||
"""Kill any existing disaggregated processes."""
|
||||
try:
|
||||
subprocess.run(['pkill', '-9', '-f', 'trtllm-serve'], check=False)
|
||||
except Exception:
|
||||
pass
|
||||
from defs.trt_test_alternative import check_call, popen
|
||||
|
||||
|
||||
def cleanup_output_files():
|
||||
@ -120,93 +113,92 @@ def run_disaggregated_test(example_dir,
|
||||
env=None,
|
||||
cwd=None):
|
||||
"""Run disaggregated test with given configuration."""
|
||||
kill_disaggregated_processes()
|
||||
cleanup_output_files()
|
||||
|
||||
num_ranks, config_file = get_test_config(test_desc, example_dir,
|
||||
os.path.dirname(__file__))
|
||||
|
||||
# Start workers
|
||||
workers_cmd = [
|
||||
'mpirun', '--allow-run-as-root', '--oversubscribe', '-n',
|
||||
str(num_ranks), 'trtllm-serve', 'disaggregated_mpi_worker', '-c',
|
||||
config_file
|
||||
]
|
||||
with open('output_workers.log', 'w') as f:
|
||||
workers_proc = subprocess.Popen(workers_cmd,
|
||||
stdout=f,
|
||||
stderr=subprocess.STDOUT,
|
||||
env=env,
|
||||
cwd=cwd)
|
||||
|
||||
server_start_timeout = 900
|
||||
# Start server
|
||||
server_cmd = [
|
||||
'trtllm-serve', 'disaggregated', '--server_start_timeout',
|
||||
str(server_start_timeout), '-c', config_file
|
||||
]
|
||||
with open('output_disagg.log', 'w') as f:
|
||||
server_proc = subprocess.Popen(server_cmd,
|
||||
stdout=f,
|
||||
stderr=subprocess.STDOUT,
|
||||
env=env,
|
||||
cwd=cwd)
|
||||
|
||||
client_dir = f"{example_dir}/clients"
|
||||
for _ in range(num_iters):
|
||||
client_cmd = [
|
||||
'python3', f'{client_dir}/disagg_client.py', '-c',
|
||||
f'{example_dir}/disagg_config.yaml', '-p',
|
||||
f'{client_dir}/prompts.json', '--ignore-eos',
|
||||
'--server-start-timeout',
|
||||
str(server_start_timeout)
|
||||
]
|
||||
subprocess.run(client_cmd, check=True, env=env)
|
||||
|
||||
# Streaming client run
|
||||
streaming_client_cmd = client_cmd + [
|
||||
'--streaming', '-o', 'output_streaming.json'
|
||||
]
|
||||
subprocess.run(streaming_client_cmd, check=True, env=env)
|
||||
|
||||
# Run the chat completion endpoint test only for TinyLlama
|
||||
if test_desc == "overlap":
|
||||
chat_client_cmd = client_cmd + [
|
||||
'-e', 'chat', '-o', 'output_chat.json'
|
||||
with ( # Start workers
|
||||
open('output_workers.log', 'w') as output_workers,
|
||||
popen(workers_cmd,
|
||||
stdout=output_workers,
|
||||
stderr=subprocess.STDOUT,
|
||||
env=env,
|
||||
cwd=cwd),
|
||||
# Start server
|
||||
open('output_disagg.log', 'w') as output_disagg,
|
||||
popen(server_cmd,
|
||||
stdout=output_disagg,
|
||||
stderr=subprocess.STDOUT,
|
||||
env=env,
|
||||
cwd=cwd)):
|
||||
client_dir = f"{example_dir}/clients"
|
||||
for _ in range(num_iters):
|
||||
client_cmd = [
|
||||
'python3', f'{client_dir}/disagg_client.py', '-c',
|
||||
f'{example_dir}/disagg_config.yaml', '-p',
|
||||
f'{client_dir}/prompts.json', '--ignore-eos',
|
||||
'--server-start-timeout',
|
||||
str(server_start_timeout)
|
||||
]
|
||||
subprocess.run(chat_client_cmd, check=True, env=env)
|
||||
check_call(client_cmd, env=env)
|
||||
|
||||
streaming_chat_client_cmd = chat_client_cmd + [
|
||||
'--streaming', '-o', 'output_streaming_chat.json'
|
||||
# Streaming client run
|
||||
streaming_client_cmd = client_cmd + [
|
||||
'--streaming', '-o', 'output_streaming.json'
|
||||
]
|
||||
subprocess.run(streaming_chat_client_cmd, check=True, env=env)
|
||||
check_call(streaming_client_cmd, env=env)
|
||||
|
||||
# Verify outputs
|
||||
not_expected_strings = ["Berlin Berlin"]
|
||||
# Run the chat completion endpoint test only for TinyLlama
|
||||
if test_desc == "overlap":
|
||||
chat_client_cmd = client_cmd + [
|
||||
'-e', 'chat', '-o', 'output_chat.json'
|
||||
]
|
||||
check_call(chat_client_cmd, env=env)
|
||||
|
||||
output_files = ['output.json', 'output_streaming.json']
|
||||
if test_desc == "overlap":
|
||||
# Disable streaming chat completion for overlap test
|
||||
# due to bug
|
||||
output_files.extend(['output_chat.json'])
|
||||
streaming_chat_client_cmd = chat_client_cmd + [
|
||||
'--streaming', '-o', 'output_streaming_chat.json'
|
||||
]
|
||||
check_call(streaming_chat_client_cmd, env=env)
|
||||
|
||||
if test_desc.startswith("gen_only"):
|
||||
continue
|
||||
# Verify outputs
|
||||
not_expected_strings = ["Berlin Berlin"]
|
||||
|
||||
for output_file in output_files:
|
||||
with open(output_file, 'r') as f:
|
||||
content = f.read()
|
||||
if "deepseek_v3_lite" in test_desc or output_file == "output_chat.json":
|
||||
expected_strings = ["Berlin", "Asyncio is a"]
|
||||
else:
|
||||
expected_strings = [
|
||||
"The capital of Germany is Berlin",
|
||||
"Asyncio is a Python library"
|
||||
]
|
||||
for expected_string in expected_strings:
|
||||
assert expected_string in content, f"Expected string '{expected_string}' not found in {output_file}"
|
||||
for not_expected_string in not_expected_strings:
|
||||
assert not_expected_string not in content, f"Unexpected string '{not_expected_string}' found in {output_file}"
|
||||
output_files = ['output.json', 'output_streaming.json']
|
||||
if test_desc == "overlap":
|
||||
# Disable streaming chat completion for overlap test
|
||||
# due to bug
|
||||
output_files.extend(['output_chat.json'])
|
||||
|
||||
if test_desc.startswith("gen_only"):
|
||||
continue
|
||||
|
||||
for output_file in output_files:
|
||||
with open(output_file, 'r') as f:
|
||||
content = f.read()
|
||||
if "deepseek_v3_lite" in test_desc or output_file == "output_chat.json":
|
||||
expected_strings = ["Berlin", "Asyncio is a"]
|
||||
else:
|
||||
expected_strings = [
|
||||
"The capital of Germany is Berlin",
|
||||
"Asyncio is a Python library"
|
||||
]
|
||||
for expected_string in expected_strings:
|
||||
assert expected_string in content, f"Expected string '{expected_string}' not found in {output_file}"
|
||||
for not_expected_string in not_expected_strings:
|
||||
assert not_expected_string not in content, f"Unexpected string '{not_expected_string}' found in {output_file}"
|
||||
|
||||
# Print outputs
|
||||
print("------------------")
|
||||
@ -221,8 +213,6 @@ def run_disaggregated_test(example_dir,
|
||||
with open('output_disagg.log', 'r') as f:
|
||||
print(f.read())
|
||||
|
||||
kill_disaggregated_processes()
|
||||
|
||||
|
||||
@pytest.mark.parametrize("llama_model_root", ['TinyLlama-1.1B-Chat-v1.0'],
|
||||
indirect=True)
|
||||
|
||||
@ -9,6 +9,7 @@ from typing import List, Optional, Tuple
|
||||
import aiohttp
|
||||
import pytest
|
||||
import yaml
|
||||
from defs.trt_test_alternative import popen
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
from tensorrt_llm import logger
|
||||
@ -53,11 +54,11 @@ def run_disaggregated_workers(
|
||||
config_file
|
||||
]
|
||||
logger.info(f"Running workers with command: {' '.join(workers_cmd)}")
|
||||
workers_proc = subprocess.Popen(workers_cmd,
|
||||
stdout=stdout,
|
||||
stderr=subprocess.STDOUT,
|
||||
env=env,
|
||||
cwd=cwd)
|
||||
workers_proc = popen(workers_cmd,
|
||||
stdout=stdout,
|
||||
stderr=subprocess.STDOUT,
|
||||
env=env,
|
||||
cwd=cwd)
|
||||
return workers_proc, ctx_servers, gen_servers
|
||||
|
||||
|
||||
@ -500,19 +501,18 @@ def load_default_prompts(disaggregated_example_root: str):
|
||||
@contextlib.contextmanager
|
||||
def background_workers(llm_venv, config_file: str, num_ranks: int = None):
|
||||
cwd = llm_venv.get_working_directory()
|
||||
log_file = open(os.path.join(cwd, 'output_workers.log'), 'w')
|
||||
workers_proc, ctx_servers, gen_servers = run_disaggregated_workers(
|
||||
config_file=config_file,
|
||||
stdout=log_file,
|
||||
env=llm_venv._new_env,
|
||||
cwd=cwd,
|
||||
num_ranks=num_ranks)
|
||||
try:
|
||||
yield ctx_servers, gen_servers
|
||||
finally:
|
||||
workers_proc.terminate()
|
||||
workers_proc.wait()
|
||||
log_file.close()
|
||||
|
||||
with open(os.path.join(cwd, 'output_workers.log'), 'w') as log_file:
|
||||
workers_proc, ctx_servers, gen_servers = run_disaggregated_workers(
|
||||
config_file=config_file,
|
||||
stdout=log_file,
|
||||
env=llm_venv._new_env,
|
||||
cwd=cwd,
|
||||
num_ranks=num_ranks)
|
||||
with workers_proc as proc:
|
||||
yield ctx_servers, gen_servers
|
||||
proc.terminate()
|
||||
proc.wait()
|
||||
|
||||
|
||||
@pytest.mark.parametrize("llama_model_root", ['TinyLlama-1.1B-Chat-v1.0'],
|
||||
|
||||
@ -741,7 +741,7 @@ def test_trtllm_bench_pytorch_backend_sanity(llm_root, llm_venv,
|
||||
dir="./",
|
||||
delete=True,
|
||||
delete_on_close=True) as running_log:
|
||||
check_call(benchmark_cmd, shell=True, running_log=running_log)
|
||||
check_call(benchmark_cmd, shell=True, stdout=running_log)
|
||||
if model_id in mapping and not use_extra_config:
|
||||
# extra config defines max kv cache tokens number to be 40000 which makes the checking
|
||||
# the checking process not unified.
|
||||
@ -775,7 +775,7 @@ def test_trtllm_bench_mgmn(llm_root, llm_venv):
|
||||
delete_on_close=True) as running_log:
|
||||
check_call(benchmark_cmd,
|
||||
shell=True,
|
||||
running_log=running_log,
|
||||
stdout=running_log,
|
||||
env=llm_venv._new_env)
|
||||
_check_mem_usage(running_log, [30, 0, 0, 0])
|
||||
|
||||
@ -928,7 +928,7 @@ def test_trtllm_bench_iteration_log(llm_root, llm_venv, model_name,
|
||||
dir="./",
|
||||
delete=True,
|
||||
delete_on_close=True) as running_log:
|
||||
check_call(benchmark_cmd, shell=True, running_log=running_log)
|
||||
check_call(benchmark_cmd, shell=True, stdout=running_log)
|
||||
_check_mem_usage(running_log, [19.4, 0, 0, 0])
|
||||
else:
|
||||
check_call(benchmark_cmd, shell=True)
|
||||
@ -1454,7 +1454,7 @@ def test_ptp_quickstart(llm_root, llm_venv):
|
||||
delete=True,
|
||||
delete_on_close=True) as running_log:
|
||||
venv_check_call(llm_venv, [str(example_root / "quickstart.py")],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [4.60, 0, 0, 0])
|
||||
|
||||
|
||||
@ -1476,6 +1476,9 @@ def test_ptp_quickstart(llm_root, llm_venv):
|
||||
pytest.param('Llama3.1-70B-FP8',
|
||||
'llama-3.1-model/Llama-3.1-70B-Instruct-FP8',
|
||||
marks=skip_pre_hopper),
|
||||
pytest.param('Nemotron-Super-49B-v1-NVFP4',
|
||||
'nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf',
|
||||
marks=skip_pre_hopper),
|
||||
pytest.param('Nemotron-Super-49B-v1-FP8',
|
||||
'nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8',
|
||||
marks=skip_pre_hopper),
|
||||
@ -1517,7 +1520,7 @@ def test_ptp_quickstart_advanced(llm_root, llm_venv, model_name, model_path):
|
||||
]
|
||||
if "Qwen3" in model_name:
|
||||
cmds.append(f"--kv_cache_fraction=0.6")
|
||||
llm_venv.run_cmd(cmds, running_log=running_log)
|
||||
llm_venv.run_cmd(cmds, stdout=running_log)
|
||||
if model_name in mapping:
|
||||
_check_mem_usage(running_log, [mapping[model_name], 0, 0, 0])
|
||||
|
||||
@ -1545,7 +1548,7 @@ def test_ptq_quickstart_advanced_mtp(llm_root, llm_venv, model_name,
|
||||
"--model_dir",
|
||||
f"{llm_models_root()}/{model_path}",
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [54.50, 0, 0, 0])
|
||||
|
||||
|
||||
@ -1601,7 +1604,7 @@ def test_ptp_quickstart_advanced_eagle3(llm_root, llm_venv, model_name,
|
||||
"--disable_kv_cache_reuse",
|
||||
"--disable_overlap_scheduler",
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [25.2, 0, 0, 0])
|
||||
|
||||
|
||||
@ -1635,7 +1638,7 @@ def test_ptp_quickstart_advanced_deepseek_r1_8gpus(llm_root, llm_venv,
|
||||
"--max_seq_len=3000",
|
||||
"--disable_kv_cache_reuse",
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [106.3, 0, 0, 0], 8)
|
||||
|
||||
|
||||
@ -1675,7 +1678,7 @@ def test_relaxed_acceptance_quickstart_advanced_deepseek_r1_8gpus(
|
||||
"--relaxed_topk=10",
|
||||
"--relaxed_delta=0.5",
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [85.6, 0, 0, 0], 8)
|
||||
# TODO: relaxed acceptance is incompatible with attention dp
|
||||
# "--enable_attention_dp"
|
||||
@ -1725,7 +1728,7 @@ def test_ptp_quickstart_advanced_8gpus(llm_root, llm_venv, model_name,
|
||||
f"{llm_models_root()}/{model_path}",
|
||||
"--tp_size=8",
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
if model_name in mapping:
|
||||
_check_mem_usage(running_log, [mapping[model_name], 0, 0, 0], 8)
|
||||
|
||||
@ -1768,7 +1771,7 @@ def test_ptp_quickstart_advanced_mixed_precision(llm_root, llm_venv):
|
||||
"--model_dir",
|
||||
f"{llm_models_root()}/{model_path}",
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
_check_mem_usage(running_log, [12.0, 0, 0, 0])
|
||||
|
||||
|
||||
@ -1959,7 +1962,7 @@ def test_ptp_quickstart_multimodal(llm_root, llm_venv, model_name, model_path,
|
||||
"--media",
|
||||
*functionality_inputs[modality]["media"],
|
||||
],
|
||||
running_log=running_log)
|
||||
stdout=running_log)
|
||||
|
||||
if model_name in mapping:
|
||||
peak, fraction = mapping[model_name]
|
||||
|
||||
@ -894,7 +894,8 @@ def prepare_gpt_2b_lora_engine(type, tensorrt_llm_gpt_example_root,
|
||||
return engine_dir
|
||||
|
||||
|
||||
def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
|
||||
def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root,
|
||||
tensorrt_llm_example_root):
|
||||
# Build GPT
|
||||
if type == "python_backend":
|
||||
engine_dir = os.path.join(tensorrt_llm_gpt_example_root, "engine_dir",
|
||||
@ -904,8 +905,7 @@ def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
|
||||
"gpt_175b_ifb")
|
||||
|
||||
convert_cmd = [
|
||||
"python3",
|
||||
f"{tensorrt_llm_gpt_example_root}/../generate_checkpoint_config.py",
|
||||
"python3", f"{tensorrt_llm_example_root}/generate_checkpoint_config.py",
|
||||
f"--output_path={engine_dir}/ckpt_config.json",
|
||||
"--architecture=GPTForCausalLM", "--dtype=float16",
|
||||
"--num_hidden_layers=96", "--num_attention_heads=96",
|
||||
@ -948,7 +948,8 @@ def prepare_gpt_175b_engine(type, tensorrt_llm_gpt_example_root):
|
||||
return engine_dir
|
||||
|
||||
|
||||
def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root):
|
||||
def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root,
|
||||
tensorrt_llm_example_root):
|
||||
# Build GPT
|
||||
if type == "python_backend":
|
||||
engine_dir = os.path.join(tensorrt_llm_gpt_example_root, "engine_dir",
|
||||
@ -958,8 +959,7 @@ def prepare_gpt_multi_node_engine(type, tensorrt_llm_gpt_example_root):
|
||||
"gpt_multi_node_ifb")
|
||||
|
||||
convert_cmd = [
|
||||
"python3",
|
||||
f"{tensorrt_llm_gpt_example_root}/../generate_checkpoint_config.py",
|
||||
"python3", f"{tensorrt_llm_example_root}/generate_checkpoint_config.py",
|
||||
f"--output_path={engine_dir}/ckpt_config.json",
|
||||
"--architecture=GPTForCausalLM", "--dtype=float16",
|
||||
"--num_hidden_layers=96", "--num_attention_heads=96",
|
||||
@ -1111,7 +1111,8 @@ def prepare_llama_v2_13b_engine(tensorrt_llm_llama_example_root,
|
||||
return engine_dir
|
||||
|
||||
|
||||
def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
|
||||
def prepare_llama_v3_8b_engine(tensorrt_llm_example_root,
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v3_8b_model_root,
|
||||
workers=8,
|
||||
data_type="bfloat16"):
|
||||
@ -1133,7 +1134,7 @@ def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
|
||||
elif data_type == "fp8":
|
||||
convert_cmd = [
|
||||
"python3",
|
||||
"../quantization/quantize.py",
|
||||
f"{tensorrt_llm_example_root}/quantization/quantize.py",
|
||||
f"--model_dir={llama_v3_8b_model_root}",
|
||||
"--dtype=float16",
|
||||
"--qformat=fp8",
|
||||
@ -1186,6 +1187,7 @@ def prepare_llama_v3_8b_engine(tensorrt_llm_llama_example_root,
|
||||
|
||||
|
||||
def prepare_llama_v3_70b_engine(type,
|
||||
tensorrt_llm_example_root,
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v3_70b_model_root,
|
||||
data_type="bfloat16"):
|
||||
@ -1211,7 +1213,7 @@ def prepare_llama_v3_70b_engine(type,
|
||||
elif data_type == "fp8":
|
||||
convert_cmd = [
|
||||
"python3",
|
||||
"../quantization/quantize.py",
|
||||
f"{tensorrt_llm_example_root}/quantization/quantize.py",
|
||||
f"--model_dir={llama_v3_70b_model_root}",
|
||||
"--dtype=float16",
|
||||
"--qformat=fp8",
|
||||
@ -1707,7 +1709,8 @@ def prepare_tiny_llama_1b_engine(type, tensorrt_llm_llama_example_root,
|
||||
return engine_dir, xgrammar_tokenizer_info_path
|
||||
|
||||
|
||||
def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_mixtral_example_root,
|
||||
def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_example_root,
|
||||
tensorrt_llm_mixtral_example_root,
|
||||
mixtral_8x7b_v0_1_model_root,
|
||||
llm_backend_root):
|
||||
engine_dir = os.path.join(tensorrt_llm_mixtral_example_root, "engine_dir",
|
||||
@ -1718,7 +1721,7 @@ def prepare_rcca_nvbug_4714193_engine(tensorrt_llm_mixtral_example_root,
|
||||
# Quantize model
|
||||
quantize_cmd = [
|
||||
"python3",
|
||||
"../quantization/quantize.py",
|
||||
f"{tensorrt_llm_example_root}/quantization/quantize.py",
|
||||
f"--model_dir={mixtral_8x7b_v0_1_model_root}",
|
||||
"--dtype=float16",
|
||||
"--qformat=fp8",
|
||||
|
||||
@ -0,0 +1,394 @@
|
||||
#!/usr/bin/env python
|
||||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions
|
||||
# are met:
|
||||
# * Redistributions of source code must retain the above copyright
|
||||
# notice, this list of conditions and the following disclaimer.
|
||||
# * Redistributions in binary form must reproduce the above copyright
|
||||
# notice, this list of conditions and the following disclaimer in the
|
||||
# documentation and/or other materials provided with the distribution.
|
||||
# * Neither the name of NVIDIA CORPORATION nor the names of its
|
||||
# contributors may be used to endorse or promote products derived
|
||||
# from this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
|
||||
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
|
||||
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
|
||||
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
|
||||
import argparse
|
||||
import queue
|
||||
import sys
|
||||
import time
|
||||
from functools import partial
|
||||
|
||||
import numpy as np
|
||||
import tritonclient.grpc as grpcclient
|
||||
from tritonclient.utils import InferenceServerException
|
||||
|
||||
#
|
||||
# Simple streaming client for TRT-LLM inflight bacthing backend
|
||||
#
|
||||
# In order for this code to work properly, config.pbtxt must contain these values:
|
||||
#
|
||||
# model_transaction_policy {
|
||||
# decoupled: True
|
||||
# }
|
||||
#
|
||||
# parameters: {
|
||||
# key: "gpt_model_type"
|
||||
# value: {
|
||||
# string_value: "inflight_batching"
|
||||
# }
|
||||
# }
|
||||
#
|
||||
# In order for gpt_model_type 'inflight_batching' to work, you must copy engine from
|
||||
#
|
||||
# tensorrt_llm/cpp/tests/resources/models/rt_engine/gpt2/fp16-inflight-batching-plugin/1-gpu/
|
||||
#
|
||||
|
||||
|
||||
class UserData:
|
||||
|
||||
def __init__(self):
|
||||
self._completed_requests = queue.Queue()
|
||||
|
||||
|
||||
def prepare_inputs(input_ids_data, input_lengths_data, request_output_len_data,
|
||||
beam_width_data, temperature_data, streaming_data, end_id):
|
||||
|
||||
inputs = [
|
||||
grpcclient.InferInput('input_ids', [1, 12], "INT32"),
|
||||
grpcclient.InferInput('input_lengths', [1, 1], "INT32"),
|
||||
grpcclient.InferInput('request_output_len', [1, 1], "UINT32"),
|
||||
grpcclient.InferInput('beam_width', [1, 1], "UINT32"),
|
||||
grpcclient.InferInput('temperature', [1, 1], "FP32"),
|
||||
grpcclient.InferInput('streaming', [1, 1], "BOOL"),
|
||||
grpcclient.InferInput('end_id', [1, 1], "UINT32"),
|
||||
]
|
||||
|
||||
inputs[0].set_data_from_numpy(input_ids_data)
|
||||
inputs[1].set_data_from_numpy(input_lengths_data)
|
||||
inputs[2].set_data_from_numpy(request_output_len_data)
|
||||
inputs[3].set_data_from_numpy(beam_width_data)
|
||||
inputs[4].set_data_from_numpy(temperature_data)
|
||||
inputs[5].set_data_from_numpy(streaming_data)
|
||||
inputs[6].set_data_from_numpy(end_id)
|
||||
|
||||
return inputs
|
||||
|
||||
|
||||
def prepare_stop_signals():
|
||||
|
||||
inputs = [
|
||||
grpcclient.InferInput('input_ids', [1, 1], "INT32"),
|
||||
grpcclient.InferInput('input_lengths', [1, 1], "INT32"),
|
||||
grpcclient.InferInput('request_output_len', [1, 1], "UINT32"),
|
||||
grpcclient.InferInput('stop', [1, 1], "BOOL"),
|
||||
]
|
||||
|
||||
inputs[0].set_data_from_numpy(np.empty([1, 1], dtype=np.int32))
|
||||
inputs[1].set_data_from_numpy(np.zeros([1, 1], dtype=np.int32))
|
||||
inputs[2].set_data_from_numpy(np.array([[0]], dtype=np.uint32))
|
||||
inputs[3].set_data_from_numpy(np.array([[True]], dtype='bool'))
|
||||
|
||||
return inputs
|
||||
|
||||
|
||||
# Define the callback function. Note the last two parameters should be
|
||||
# result and error. InferenceServerClient would povide the results of an
|
||||
# inference as grpcclient.InferResult in result. For successful
|
||||
# inference, error will be None, otherwise it will be an object of
|
||||
# tritonclientutils.InferenceServerException holding the error details
|
||||
def callback(user_data, result, error):
|
||||
if error:
|
||||
user_data._completed_requests.put(error)
|
||||
else:
|
||||
user_data._completed_requests.put(result)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"-v",
|
||||
"--verbose",
|
||||
action="store_true",
|
||||
required=False,
|
||||
default=False,
|
||||
help="Enable verbose output",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-u",
|
||||
"--url",
|
||||
type=str,
|
||||
required=False,
|
||||
default="localhost:8001",
|
||||
help="Inference server URL. Default is localhost:8001.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-s",
|
||||
"--ssl",
|
||||
action="store_true",
|
||||
required=False,
|
||||
default=False,
|
||||
help="Enable SSL encrypted channel to the server",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-t",
|
||||
"--stream-timeout",
|
||||
type=float,
|
||||
required=False,
|
||||
default=None,
|
||||
help="Stream timeout in seconds. Default is None.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-r",
|
||||
"--root-certificates",
|
||||
type=str,
|
||||
required=False,
|
||||
default=None,
|
||||
help="File holding PEM-encoded root certificates. Default is None.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-p",
|
||||
"--private-key",
|
||||
type=str,
|
||||
required=False,
|
||||
default=None,
|
||||
help="File holding PEM-encoded private key. Default is None.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-x",
|
||||
"--certificate-chain",
|
||||
type=str,
|
||||
required=False,
|
||||
default=None,
|
||||
help="File holding PEM-encoded certificate chain. Default is None.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-C",
|
||||
"--grpc-compression-algorithm",
|
||||
type=str,
|
||||
required=False,
|
||||
default=None,
|
||||
help=
|
||||
"The compression algorithm to be used when sending request to server. Default is None.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-S",
|
||||
"--streaming",
|
||||
action="store_true",
|
||||
required=False,
|
||||
default=False,
|
||||
help="Enable streaming mode. Default is False.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"-c",
|
||||
"--check-output",
|
||||
action="store_true",
|
||||
required=False,
|
||||
default=False,
|
||||
help="Enable check of output ids for CI",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-b",
|
||||
"--beam-width",
|
||||
required=False,
|
||||
type=int,
|
||||
default=1,
|
||||
help="Beam width value",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--temperature",
|
||||
type=float,
|
||||
required=False,
|
||||
default=1.0,
|
||||
help="temperature value",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--request-output-len",
|
||||
type=int,
|
||||
required=False,
|
||||
default=16,
|
||||
help="temperature value",
|
||||
)
|
||||
parser.add_argument(
|
||||
'--stop-after-ms',
|
||||
type=int,
|
||||
required=False,
|
||||
default=0,
|
||||
help='Early stop the generation after a few milliseconds')
|
||||
|
||||
FLAGS = parser.parse_args()
|
||||
|
||||
print('=========')
|
||||
input_ids = [[
|
||||
28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257
|
||||
]]
|
||||
input_ids_data = np.array(input_ids, dtype=np.int32)
|
||||
input_lengths = [[len(ii)] for ii in input_ids]
|
||||
input_lengths_data = np.array(input_lengths, dtype=np.int32)
|
||||
request_output_len = [[FLAGS.request_output_len]]
|
||||
request_output_len_data = np.array(request_output_len, dtype=np.uint32)
|
||||
beam_width = [[FLAGS.beam_width]]
|
||||
beam_width_data = np.array(beam_width, dtype=np.uint32)
|
||||
temperature = [[FLAGS.temperature]]
|
||||
temperature_data = np.array(temperature, dtype=np.float32)
|
||||
streaming = [[FLAGS.streaming]]
|
||||
streaming_data = np.array(streaming, dtype=bool)
|
||||
end_id = np.array([[6303]], dtype=np.uint32)
|
||||
|
||||
inputs = prepare_inputs(input_ids_data, input_lengths_data,
|
||||
request_output_len_data, beam_width_data,
|
||||
temperature_data, streaming_data, end_id)
|
||||
|
||||
if FLAGS.stop_after_ms > 0:
|
||||
stop_inputs = prepare_stop_signals()
|
||||
else:
|
||||
stop_inputs = None
|
||||
|
||||
request_id = "12345"
|
||||
import random
|
||||
request_id = str(random.randint(3, 9000))
|
||||
|
||||
expected_output_ids = [
|
||||
input_ids[0] + [
|
||||
21221, 290, 257, 4255, 379, 262, 1957, 7072, 11, 4689, 347, 2852,
|
||||
2564, 494, 13, 679
|
||||
]
|
||||
]
|
||||
if FLAGS.streaming:
|
||||
actual_output_ids = [input_ids[0]]
|
||||
else:
|
||||
actual_output_ids = []
|
||||
|
||||
user_data = UserData()
|
||||
with grpcclient.InferenceServerClient(
|
||||
url=FLAGS.url,
|
||||
verbose=FLAGS.verbose,
|
||||
ssl=FLAGS.ssl,
|
||||
root_certificates=FLAGS.root_certificates,
|
||||
private_key=FLAGS.private_key,
|
||||
certificate_chain=FLAGS.certificate_chain,
|
||||
) as triton_client:
|
||||
try:
|
||||
|
||||
if FLAGS.streaming:
|
||||
|
||||
# Establish stream
|
||||
triton_client.start_stream(
|
||||
callback=partial(callback, user_data),
|
||||
stream_timeout=FLAGS.stream_timeout,
|
||||
)
|
||||
# Send request
|
||||
triton_client.async_stream_infer(
|
||||
'tensorrt_llm',
|
||||
inputs,
|
||||
request_id=request_id,
|
||||
)
|
||||
|
||||
if stop_inputs is not None:
|
||||
|
||||
time.sleep(FLAGS.stop_after_ms / 1000.0)
|
||||
|
||||
triton_client.async_stream_infer(
|
||||
'tensorrt_llm',
|
||||
stop_inputs,
|
||||
request_id=request_id,
|
||||
parameters={'Streaming': FLAGS.streaming})
|
||||
|
||||
#Wait for server to close the stream
|
||||
triton_client.stop_stream()
|
||||
|
||||
# Parse the responses
|
||||
while True:
|
||||
try:
|
||||
result = user_data._completed_requests.get(block=False)
|
||||
except Exception:
|
||||
break
|
||||
|
||||
if type(result) == InferenceServerException:
|
||||
print("Received an error from server:")
|
||||
print(result)
|
||||
else:
|
||||
output_ids = result.as_numpy('output_ids')
|
||||
|
||||
if output_ids is not None:
|
||||
if (FLAGS.streaming):
|
||||
# Only one beam is supported
|
||||
tokens = list(output_ids[0][0])
|
||||
actual_output_ids[
|
||||
0] = actual_output_ids[0] + tokens
|
||||
else:
|
||||
for beam_output_ids in output_ids[0]:
|
||||
tokens = list(beam_output_ids)
|
||||
actual_output_ids.append(tokens)
|
||||
else:
|
||||
print("Got cancellation response from server")
|
||||
else:
|
||||
# Send request
|
||||
triton_client.async_infer(
|
||||
'tensorrt_llm',
|
||||
inputs,
|
||||
request_id=request_id,
|
||||
callback=partial(callback, user_data),
|
||||
parameters={'Streaming': FLAGS.streaming})
|
||||
|
||||
if stop_inputs is not None:
|
||||
|
||||
time.sleep(FLAGS.stop_after_ms / 1000.0)
|
||||
|
||||
triton_client.async_infer(
|
||||
'tensorrt_llm',
|
||||
stop_inputs,
|
||||
request_id=request_id,
|
||||
callback=partial(callback, user_data),
|
||||
parameters={'Streaming': FLAGS.streaming})
|
||||
|
||||
processed_count = 0
|
||||
expected_responses = 1 + (1 if stop_inputs is not None else 0)
|
||||
while processed_count < expected_responses:
|
||||
try:
|
||||
result = user_data._completed_requests.get()
|
||||
print("Got completed request", flush=True)
|
||||
except Exception:
|
||||
break
|
||||
|
||||
if type(result) == InferenceServerException:
|
||||
print("Received an error from server:")
|
||||
print(result)
|
||||
else:
|
||||
output_ids = result.as_numpy('output_ids')
|
||||
if output_ids is not None:
|
||||
for beam_output_ids in output_ids[0]:
|
||||
tokens = list(beam_output_ids)
|
||||
actual_output_ids.append(tokens)
|
||||
else:
|
||||
print("Got response for cancellation request")
|
||||
|
||||
processed_count = processed_count + 1
|
||||
except Exception as e:
|
||||
print("channel creation failed: " + str(e))
|
||||
sys.exit()
|
||||
|
||||
passed = True
|
||||
|
||||
print("output_ids = ", actual_output_ids)
|
||||
if (FLAGS.check_output):
|
||||
passed = (actual_output_ids == expected_output_ids)
|
||||
print("expected_output_ids = ", expected_output_ids)
|
||||
print("\n=====")
|
||||
print("PASS!" if passed else "FAIL!")
|
||||
print("=====")
|
||||
|
||||
sys.exit(not passed)
|
||||
@ -2145,6 +2145,7 @@ def test_llama_v3_speculative_decoding_bls(
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v3_8b_model_root,
|
||||
llama_v3_70b_model_root,
|
||||
tensorrt_llm_example_root,
|
||||
llm_backend_inflight_batcher_llm_root,
|
||||
llm_backend_dataset_root,
|
||||
llm_backend_venv,
|
||||
@ -2161,16 +2162,19 @@ def test_llama_v3_speculative_decoding_bls(
|
||||
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
|
||||
# Build engine
|
||||
DRAFT_ENGINE_DIR = prepare_llama_v3_8b_engine(
|
||||
tensorrt_llm_example_root,
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v3_8b_model_root,
|
||||
data_type=DATA_TYPE)
|
||||
CONTROL_ENGINE_DIR = prepare_llama_v3_70b_engine(
|
||||
"control_ifb",
|
||||
tensorrt_llm_example_root,
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v3_70b_model_root,
|
||||
data_type=DATA_TYPE)
|
||||
TARGET_ENGINE_DIR = prepare_llama_v3_70b_engine(
|
||||
"target_ifb",
|
||||
tensorrt_llm_example_root,
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v3_70b_model_root,
|
||||
data_type=DATA_TYPE)
|
||||
@ -2310,6 +2314,7 @@ def test_gpt_175b_dummyWeights_ifb(
|
||||
EXCLUDE_INPUT_IN_OUTPUT,
|
||||
inflight_batcher_llm_client_root,
|
||||
tensorrt_llm_gpt_example_root,
|
||||
tensorrt_llm_example_root,
|
||||
gpt_tokenizer_model_root,
|
||||
llm_backend_venv,
|
||||
):
|
||||
@ -2321,7 +2326,8 @@ def test_gpt_175b_dummyWeights_ifb(
|
||||
|
||||
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
|
||||
# Build Engine
|
||||
ENGINE_PATH = prepare_gpt_175b_engine("ifb", tensorrt_llm_gpt_example_root)
|
||||
ENGINE_PATH = prepare_gpt_175b_engine("ifb", tensorrt_llm_gpt_example_root,
|
||||
tensorrt_llm_example_root)
|
||||
# Prepare model repo
|
||||
new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
|
||||
prepare_ib_model_repo(llm_backend_repo_root, new_model_repo)
|
||||
|
||||
@ -86,7 +86,8 @@ def test_valgrind_llama_v2_13b(
|
||||
|
||||
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
|
||||
# Build engine
|
||||
ENGINE_PATH = prepare_llama_v2_13b_engine(tensorrt_llm_llama_example_root,
|
||||
ENGINE_PATH = prepare_llama_v2_13b_engine(tensorrt_llm_example_root,
|
||||
tensorrt_llm_llama_example_root,
|
||||
llama_v2_tokenizer_model_root)
|
||||
|
||||
# Prepare model repo
|
||||
|
||||
@ -10,6 +10,7 @@ from .common import *
|
||||
@pytest.mark.skip_less_device_memory(80000)
|
||||
def test_gpt175b_dummyWeights_multi_node_engine_config(
|
||||
tensorrt_llm_gpt_example_root,
|
||||
tensorrt_llm_example_root,
|
||||
gpt_tokenizer_model_root,
|
||||
):
|
||||
ACCUMULATE_TOKEN = "False"
|
||||
@ -36,7 +37,8 @@ def test_gpt175b_dummyWeights_multi_node_engine_config(
|
||||
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
|
||||
# Build Engine
|
||||
ENGINE_PATH = prepare_gpt_multi_node_engine("ifb",
|
||||
tensorrt_llm_gpt_example_root)
|
||||
tensorrt_llm_gpt_example_root,
|
||||
tensorrt_llm_example_root)
|
||||
# Prepare model repo
|
||||
new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
|
||||
prepare_ib_model_repo(llm_backend_repo_root, new_model_repo)
|
||||
|
||||
@ -42,7 +42,7 @@ def get_rcca_path():
|
||||
@pytest.mark.parametrize("KV_CACHE_FREE_GPU_MEM_FRACTION", [""])
|
||||
@pytest.mark.parametrize("ENABLE_TRT_OVERLAP", ["False"],
|
||||
ids=["disableTrtOverlap"])
|
||||
@pytest.mark.parametrize("BATCHING_STRATEGY", ["V1"])
|
||||
@pytest.mark.parametrize("BATCHING_STRATEGY", ["inflight_fused_batching"])
|
||||
@pytest.mark.parametrize("DECOUPLED_MODE", ["False"],
|
||||
ids=["disableDecoupleMode"])
|
||||
@pytest.mark.parametrize("TRITON_MAX_BATCH_SIZE", ["128"])
|
||||
@ -618,6 +618,7 @@ def test_rcca_bug_4714193(
|
||||
TOP_K,
|
||||
TOP_P,
|
||||
TEMPERATURE,
|
||||
tensorrt_llm_example_root,
|
||||
tensorrt_llm_mixtral_example_root,
|
||||
mixtral_8x7b_v0_1_model_root,
|
||||
llm_backend_root,
|
||||
@ -631,8 +632,8 @@ def test_rcca_bug_4714193(
|
||||
llm_backend_repo_root = os.environ["LLM_BACKEND_ROOT"]
|
||||
# Build engine
|
||||
ENGINE_PATH = prepare_rcca_nvbug_4714193_engine(
|
||||
tensorrt_llm_mixtral_example_root, mixtral_8x7b_v0_1_model_root,
|
||||
llm_backend_root)
|
||||
tensorrt_llm_example_root, tensorrt_llm_mixtral_example_root,
|
||||
mixtral_8x7b_v0_1_model_root, llm_backend_root)
|
||||
|
||||
# Prepare model repo
|
||||
new_model_repo = os.path.join(llm_backend_repo_root, "triton_repo")
|
||||
|
||||
@ -6,7 +6,8 @@ import platform
|
||||
import signal
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
import warnings
|
||||
|
||||
import psutil
|
||||
|
||||
@ -68,7 +69,9 @@ if is_linux():
|
||||
|
||||
return pids
|
||||
|
||||
def cleanup_process_tree(p: subprocess.Popen, has_session=False):
|
||||
def cleanup_process_tree(p: subprocess.Popen,
|
||||
has_session=False,
|
||||
verbose_message=False):
|
||||
target_pids = set()
|
||||
if has_session:
|
||||
# Session ID is the pid of the leader process
|
||||
@ -82,8 +85,30 @@ if is_linux():
|
||||
except psutil.Error:
|
||||
pass
|
||||
|
||||
print("Found leftover pids:", target_pids)
|
||||
for pid in target_pids:
|
||||
persist_pids = []
|
||||
if target_pids:
|
||||
# Grace period
|
||||
time.sleep(5)
|
||||
|
||||
lines = []
|
||||
for pid in sorted(target_pids):
|
||||
try:
|
||||
sp = psutil.Process(pid)
|
||||
if verbose_message:
|
||||
cmdline = sp.cmdline()
|
||||
lines.append(f"{pid}: {cmdline}")
|
||||
persist_pids.append(pid)
|
||||
except psutil.Error:
|
||||
pass
|
||||
|
||||
if persist_pids:
|
||||
msg = f"Found leftover subprocesses: {persist_pids} launched by {p.args}"
|
||||
if verbose_message:
|
||||
detail = '\n'.join(lines)
|
||||
msg = f"{msg}\n{detail}"
|
||||
warnings.warn(msg)
|
||||
|
||||
for pid in persist_pids:
|
||||
try:
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
except (ProcessLookupError, PermissionError):
|
||||
@ -148,6 +173,29 @@ elif is_windows():
|
||||
p.kill()
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def popen(*popenargs,
|
||||
start_new_session=True,
|
||||
suppress_output_info=False,
|
||||
**kwargs):
|
||||
if not suppress_output_info:
|
||||
print(f"Start subprocess with popen({popenargs}, {kwargs})")
|
||||
|
||||
with Popen(*popenargs, start_new_session=start_new_session, **kwargs) as p:
|
||||
try:
|
||||
yield p
|
||||
if start_new_session:
|
||||
cleanup_process_tree(p, True, True)
|
||||
except Exception as e:
|
||||
cleanup_process_tree(p, start_new_session)
|
||||
if isinstance(e, subprocess.TimeoutExpired):
|
||||
print("Process timed out.")
|
||||
stdout, stderr = p.communicate()
|
||||
e.output = stdout
|
||||
e.stderr = stderr
|
||||
raise
|
||||
|
||||
|
||||
def call(*popenargs,
|
||||
timeout=None,
|
||||
start_new_session=True,
|
||||
@ -155,31 +203,11 @@ def call(*popenargs,
|
||||
**kwargs):
|
||||
if not suppress_output_info:
|
||||
print(f"Start subprocess with call({popenargs}, {kwargs})")
|
||||
|
||||
running_log = None
|
||||
if "running_log" in kwargs:
|
||||
if isinstance(kwargs["running_log"], tempfile._TemporaryFileWrapper):
|
||||
running_log = kwargs["running_log"]
|
||||
kwargs.pop("running_log", 'Not Found')
|
||||
with Popen(*popenargs,
|
||||
with popen(*popenargs,
|
||||
start_new_session=start_new_session,
|
||||
stdout=running_log,
|
||||
suppress_output_info=True,
|
||||
**kwargs) as p:
|
||||
try:
|
||||
retcode = p.wait(timeout=timeout)
|
||||
if retcode and start_new_session:
|
||||
cleanup_process_tree(p, True)
|
||||
return retcode
|
||||
except Exception as e:
|
||||
if isinstance(e, subprocess.TimeoutExpired):
|
||||
print("Process timed out.")
|
||||
stdout, stderr = p.communicate()
|
||||
if stdout:
|
||||
print("STDOUT:", stdout.decode('utf-8', errors='replace'))
|
||||
if stderr:
|
||||
print("STDERR:", stderr.decode('utf-8', errors='replace'))
|
||||
cleanup_process_tree(p, start_new_session)
|
||||
raise
|
||||
return p.wait(timeout=timeout)
|
||||
|
||||
|
||||
def check_call(*popenargs, **kwargs):
|
||||
@ -212,9 +240,9 @@ def check_output(*popenargs, timeout=None, start_new_session=True, **kwargs):
|
||||
cleanup_process_tree(process, start_new_session)
|
||||
raise
|
||||
retcode = process.poll()
|
||||
if start_new_session:
|
||||
cleanup_process_tree(process, True, True)
|
||||
if retcode:
|
||||
if start_new_session:
|
||||
cleanup_process_tree(process, True)
|
||||
raise subprocess.CalledProcessError(retcode,
|
||||
process.args,
|
||||
output=stdout,
|
||||
|
||||
@ -375,6 +375,14 @@ accuracy/test_cli_flow.py::TestLlama3_2_1B::test_fp8_rowwise
|
||||
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_weight_streaming[1.0]
|
||||
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache
|
||||
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache_beam_search
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_smooth_quant
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_smooth_quant_ootb
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_int4_awq
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_int4_awq_int8_kv_cache
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_pp2
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_fp8_rowwise
|
||||
accuracy/test_cli_flow.py::TestMistral7B::test_beam_search
|
||||
accuracy/test_cli_flow.py::TestMistral7B::test_fp8_tp4pp2
|
||||
accuracy/test_cli_flow.py::TestMistral7B::test_smooth_quant_tp4pp1
|
||||
@ -425,7 +433,7 @@ accuracy/test_llm_api.py::TestMixtral8x7B::test_tp2
|
||||
accuracy/test_llm_api.py::TestMixtral8x7B::test_smooth_quant_tp2pp2
|
||||
accuracy/test_llm_api.py::TestMixtral8x7BInstruct::test_awq_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8B::test_nvfp4
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_decoder
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_llm_sampler
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_3_70BInstruct::test_fp8_tp4
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_3_70BInstruct::test_nvfp4_tp4
|
||||
accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
|
||||
@ -445,8 +453,16 @@ accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp_
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
|
||||
accuracy/test_llm_api_pytorch.py::TestMinitron4BBaseInstruct::test_fp8_prequantized
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronNas::test_auto_dtype_tp8
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronSuper::test_auto_dtype_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2
|
||||
accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
|
||||
accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_fp8_prequantized_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronNano::test_auto_dtype
|
||||
accuracy/test_cli_flow.py::TestNemotronNano::test_auto_dtype
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
|
||||
accuracy/test_cli_flow.py::TestNemotronUltra::test_auto_dtype[tp8ep4-cuda_graph=True]
|
||||
accuracy/test_cli_flow.py::TestNemotronUltra::test_fp8_prequantized[tp8ep4-cuda_graph=True]
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronH::test_auto_dtype
|
||||
accuracy/test_llm_api_pytorch.py::TestQwen2_7BInstruct::test_auto_dtype
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_nvfp4_8gpus[latency]
|
||||
|
||||
@ -24,6 +24,7 @@ test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-FP8-llama-3.1-model/Llama-
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-8B]
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-70B]
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-FP8-llama-3.1-model/Llama-3.1-70B-Instruct-FP8]
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-NVFP4-nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf]
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-FP8-nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8]
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-NVFP4-nvfp4-quantized/Mixtral-8x7B-Instruct-v0.1]
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-FP8-Mixtral-8x7B-Instruct-v0.1-fp8]
|
||||
|
||||
@ -101,6 +101,7 @@ accuracy/test_cli_flow.py::TestLlama3_1_8B::test_fp8_rowwise_tp4[disable_gemm_al
|
||||
accuracy/test_cli_flow.py::TestLlama3_1_8B::test_autoq
|
||||
accuracy/test_cli_flow.py::TestLlama3_1_8BInstruct::test_medusa_fp8_prequantized
|
||||
accuracy/test_cli_flow.py::TestLlama3_2_1B::test_auto_dtype
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_2_1B::test_auto_dtype
|
||||
accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_fp8_prequantized_tp4
|
||||
accuracy/test_cli_flow.py::TestLlama3_3_70BInstruct::test_nvfp4_prequantized_tp4
|
||||
accuracy/test_cli_flow.py::TestMistral7B::test_fp8_tp4pp2
|
||||
@ -120,14 +121,21 @@ accuracy/test_llm_api_pytorch.py::TestLlama4ScoutInstruct::test_auto_dtype[tp8-c
|
||||
accuracy/test_llm_api_pytorch.py::TestMixtral8x7B::test_fp8_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestMixtral8x7B::test_nvfp4_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronNas::test_auto_dtype_tp8
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronSuper::test_auto_dtype_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
|
||||
accuracy/test_cli_flow.py::TestLlama3_3NemotronSuper49Bv1::test_auto_dtype_tp2
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronNano::test_auto_dtype
|
||||
accuracy/test_cli_flow.py::TestNemotronNano::test_auto_dtype
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_auto_dtype[tp8-cuda_graph=False]
|
||||
accuracy/test_llm_api_pytorch.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=False]
|
||||
accuracy/test_cli_flow.py::TestNemotronUltra::test_auto_dtype[tp8-cuda_graph=False]
|
||||
accuracy/test_cli_flow.py::TestNemotronUltra::test_fp8_prequantized[tp8-cuda_graph=False]
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]
|
||||
accuracy/test_llm_api_pytorch.py::TestQwen3_8B::test_fp8_block_scales[latency]
|
||||
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8_block_scales[latency]
|
||||
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[latency_moe_cutlass]
|
||||
accuracy/test_llm_api_pytorch.py::TestPhi4MiniInstruct::test_auto_dtype
|
||||
|
||||
# Pivot to Pytorch test cases.
|
||||
test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-BF16-llama-3.1-model/Meta-Llama-3.1-8B]
|
||||
|
||||
@ -58,9 +58,6 @@ trt_llm_release_perf_sanity_test:
|
||||
# E2E gptManagerBenchmark IFB
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-cppmanager-exe-static_batching-plugin_ifb-float16-bs:8+64-input_output_len:128,128+512,32]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-cppmanager-exe-plugin_ifb-bfloat16-gwp:0.0-input_output_len:128,128+512,32]
|
||||
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:512,32]
|
||||
- perf/test_perf.py::test_perf[qwen2_7b_instruct-bench-float16-input_output_len:128,128]
|
||||
|
||||
|
||||
@ -49,8 +49,14 @@ trt_llm_release_perf_test:
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-bfloat16-maxbs:64-input_output_len:20000,2000-con:250]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-bfloat16-maxbs:64-input_output_len:20000,2000-quant:fp8-con:250]
|
||||
# pyt backend
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-input_output_len:128,128]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-input_output_len:2000,2000]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-maxnt:5000-input_output_len:5000,500-reqs:8-con:1]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:500,2000-reqs:8-con:1]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:1000,1000-reqs:8-con:1]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-maxnt:20000-input_output_len:20000,2000-reqs:8-con:1]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:5000,500-reqs:500-con:250]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:500,2000-reqs:500-con:250]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:1000,1000-reqs:500-con:250]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_nemotron_nano_8b-bench-pytorch-bfloat16-maxbs:512-input_output_len:20000,2000-reqs:500-con:250]
|
||||
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:128,128]
|
||||
- perf/test_perf.py::test_perf[llama_v3.1_8b_instruct-bench-bfloat16-input_output_len:512,32]
|
||||
|
||||
@ -19,5 +19,6 @@ l0_dgx_h200:
|
||||
- accuracy/test_llm_api_pytorch.py::TestDeepSeekR1::test_fp8_blockscale[latency] # 1h
|
||||
- accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype[True]
|
||||
- accuracy/test_disaggregated_serving.py::TestLlama4ScoutInstruct::test_auto_dtype[False]
|
||||
- unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-enable_graph-tp8-trtllm-scout]
|
||||
- unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-disable_adp-enable_graph-tp8-trtllm-scout]
|
||||
- unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep4-enable_adp-enable_graph-tp8-trtllm-scout]
|
||||
- unittest/llmapi/test_llm_pytorch.py::test_nemotron_nas_lora
|
||||
|
||||
@ -25,6 +25,7 @@ l0_rtx_pro_6000:
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-8B-FP8-llama-3.1-model/Llama-3.1-8B-Instruct-FP8]
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-NVFP4-nvfp4-quantized/Meta-Llama-3.1-70B]
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Llama3.1-70B-FP8-llama-3.1-model/Llama-3.1-70B-Instruct-FP8]
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-NVFP4-nvfp4-quantized/Llama-3_3-Nemotron-Super-49B-v1_nvfp4_hf]
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Nemotron-Super-49B-v1-FP8-nemotron-nas/Llama-3_3-Nemotron-Super-49B-v1-FP8]
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-NVFP4-nvfp4-quantized/Mixtral-8x7B-Instruct-v0.1]
|
||||
- test_e2e.py::test_ptp_quickstart_advanced[Mixtral-8x7B-FP8-Mixtral-8x7B-Instruct-v0.1-fp8]
|
||||
|
||||
@ -83,7 +83,6 @@ full:B200_PCIe/examples/test_llama.py::test_llm_llama_v2_lora_1gpu[chinese-llama
|
||||
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-mini-128k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-8k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/functional SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/quantization SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
|
||||
@ -174,7 +173,6 @@ full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-128k
|
||||
full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3-mini-128k-instruct-fp8-float16] SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3.5-mini-instruct-fp8-float16] SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/functional SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/quantization SKIP (Disable for Blackwell)
|
||||
full:B200/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
|
||||
@ -330,11 +328,6 @@ full:B200/test_e2e.py::test_ptp_quickstart_advanced[Nemotron4_4B-BF16-nemotron/M
|
||||
full:B200/test_e2e.py::test_ptp_scaffolding[DeepSeek-R1-Distill-Qwen-7B-DeepSeek-R1/DeepSeek-R1-Distill-Qwen-7B] SKIP (https://nvbugs/5136994)
|
||||
full:B200/test_e2e.py::test_trtllm_bench_pytorch_backend_sanity[meta-llama/Llama-3.1-8B-llama-3.1-8b-hf-nvfp4-False-False] SKIP (https://nvbugs/5136994)
|
||||
examples/test_multimodal.py::test_llm_multimodal_general[kosmos-2-pp:1-tp:1-float16-bs:8-cpp_e2e:True-nb:1] SKIP (https://nvbugs/5141288)
|
||||
examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2_vl_7b_instruct-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5141290)
|
||||
examples/test_qwen.py::test_llm_qwen_single_gpu_summary[qwen2_vl_7b_instruct-enable_paged_kv_cache-enable_remove_input_padding-disable_weight_only-disable_fmha] SKIP (https://nvbugs/5141290)
|
||||
examples/test_qwen.py::test_llm_qwen_single_gpu_summary[qwen2_vl_7b_instruct-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha_fp32_acc] SKIP (https://nvbugs/5141290)
|
||||
examples/test_qwen.py::test_llm_qwen_awq_single_gpu_summary[qwen2_vl_7b_instruct-nb:4] SKIP (https://nvbugs/5141290)
|
||||
examples/test_qwen.py::test_llm_hf_qwen_quantization_1gpu[qwen2_vl_7b_instruct-fp8-bfloat16] SKIP (https://nvbugs/5141290)
|
||||
unittest/_torch/auto_deploy/integration/test_lm_eval.py SKIP (https://nvbugs/5144854)
|
||||
examples/test_qwen.py::test_llm_qwen1_5_moe_plugin_single_gpu_lora[qwen1.5_moe_a2.7b_chat-Upcycled-Qwen1.5-MoE2.7B-LoRA] SKIP (https://nvbugs/5155141)
|
||||
|
||||
@ -368,7 +361,6 @@ full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:w4
|
||||
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:int8_sq_per_tensor] SKIP (https://nvbugspro.nvidia.com/bug/5161074)
|
||||
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[quant:int8_sq_per_token_channel] SKIP (https://nvbugspro.nvidia.com/bug/5161074)
|
||||
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_cpp_session-recurrentgemma-2b-use_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5174573)
|
||||
examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (https://nvbugs/5180961)
|
||||
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-no_paged_cache-disable_quant-float16-disable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
|
||||
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-no_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
|
||||
examples/test_recurrentgemma.py::test_llm_recurrentgemma_1gpu[use_py_session-recurrentgemma-2b-use_paged_cache-disable_quant-float16-enable_attn_plugin-enable_gemm_plugin] SKIP (https://nvbugs/5214221)
|
||||
@ -401,6 +393,9 @@ perf/test_perf.py::test_perf[t5-bench-float16-input_output_len:128,20-gpus:2] SK
|
||||
perf/test_perf.py::test_perf[t5-bench-float16-maxbs:1-input_output_len:128,20-gpus:2] SKIP
|
||||
perf/test_perf.py::test_perf[gpt_20b-bench-float16-maxbs:8-input_output_len:128,128-reqs:80-gpus:8] SKIP
|
||||
perf/test_perf.py::test_perf[gpt_20b-bench-float16-maxbs:8-input_output_len:512,32-reqs:80-gpus:8] SKIP
|
||||
full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:512-input_output_len:128,128-ep:8-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
|
||||
full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:1-input_output_len:1000,2000-reqs:10-ep:4-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
|
||||
full:B200/perf/test_perf.py::test_perf[deepseek_r1_fp8-bench-pytorch-float8-maxbs:384-maxnt:1536-input_output_len:1000,2000-reqs:49152-con:3072-ep:8-tp:8-gpus:8] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
|
||||
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[deepseek_v3_lite_fp8-bench-pytorch-float8-input_output_len:128,128] SKIP (https://nvbugspro.nvidia.com/bug/5150255)
|
||||
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[mixtral_8x7b_v0.1_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-tp:2-gpus:2] SKIP #https://docs.google.com/spreadsheets/d/1EvwCcJ5o2zmhVxFFxAAz-49UzswMlfN2y5K37Fkyw7A/edit?gid=907483661#gid=907483661
|
||||
full:RTX_PRO_6000_Blackwell_Server_Edition/perf/test_perf.py::test_perf[llama_v3.3_nemotron_49b-bench-pytorch-bfloat16-input_output_len:128,128-tp:2-gpus:2] SKIP #https://docs.google.com/spreadsheets/d/1EvwCcJ5o2zmhVxFFxAAz-49UzswMlfN2y5K37Fkyw7A/edit?gid=907483661#gid=907483661
|
||||
@ -413,6 +408,7 @@ accuracy/test_cli_flow.py::TestLlama3_2_1B::test_cyclic_kv_cache SKIP (https://n
|
||||
test_e2e.py::test_ptp_quickstart_multimodal[NVILA-8B-FP16-vila/NVILA-8B-image] SKIP (https://nvbugs/5233423)
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[tp4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16_4gpus[ep4-mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[tp4-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False] SKIP (https://nvbugs/5294983)
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[tp4-mtp_nextn=2-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_4gpus[ep4-mtp_nextn=2-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] SKIP (https://nvbugs/5239087)
|
||||
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False] SKIP (https://nvbugs/5234002)
|
||||
@ -426,13 +422,13 @@ examples/test_bert.py::test_llm_bert_general[compare_hf-enable_remove_input_padd
|
||||
examples/test_bert.py::test_llm_bert_general[compare_hf-enable_remove_input_padding-use_attention_plugin-enable_context_fmha-tp:2-pp:1-float16-RobertaForQuestionAnswering-bert/roberta-base-squad2] SKIP (https://nvbugs/5234058)
|
||||
disaggregated/test_disaggregated.py::test_disaggregated_cuda_graph[TinyLlama-1.1B-Chat-v1.0] SKIP (https://nvbugs/5247271)
|
||||
disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_tp1_attention_dp_overlap_one_mtp[DeepSeek-V3-Lite-fp8] SKIP (https://nvbugspro.nvidia.com/bug/5273945)
|
||||
unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
|
||||
unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep1-disable_adp-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
|
||||
unittest/_torch/multi_gpu_modeling/test_llama4.py::test_llama4[pp1-ep4-enable_adp-enable_graph-tp8-trtllm-scout] SKIP (https://nvbugs/5274229)
|
||||
accuracy/test_cli_flow.py::TestLlama3_1_8B::test_tp4[enable_gemm_allreduce_plugin] SKIP (https://nvbugs/5247786)
|
||||
full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen1.5_7b_chat-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
|
||||
full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2_7b_instruct-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
|
||||
full:B200/examples/test_qwen.py::test_llm_qwen_7b_multi_gpus_summary[qwen2.5_7b_chat-enable_fmha_fp32_acc-enable_plugin-tp2pp2-nb:4] SKIP (https://nvbugs/5247837)
|
||||
full:B200/examples/test_mixtral.py::test_llm_mixtral_pp_reduce_scatter_4gpus[Mixtral-8x7B-v0.1] SKIP (https://nvbugs/5247837)
|
||||
examples/test_qwen.py::test_llm_qwen_smooth_quant_single_gpu_summary[qwen2_vl_7b_instruct-enable_ptpc-nb:4] SKIP (https://nvbugs/5273694)
|
||||
accuracy/test_cli_flow.py::TestMixtral8x22B::test_int8_plugin_tp8[renormalize-tensor_parallel] SKIP (https://nvbugs/5273695)
|
||||
test_e2e.py::test_ptp_quickstart_advanced_8gpus[Nemotron-Ultra-253B-nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1] SKIP (https://nvbugs/5273697)
|
||||
examples/test_gpt.py::test_starcoder_fp8_quantization_2gpu[starcoder] SKIP (https://nvbugs/5144931)
|
||||
@ -440,7 +436,6 @@ examples/test_gpt.py::test_starcoder_fp8_quantization_2gpu[starcoderplus] SKIP (
|
||||
unittest/_torch -k "not (modeling or multi_gpu or auto_deploy)" SKIP (https://nvbugs/5280806)
|
||||
examples/test_whisper.py::test_llm_whisper_general[large-v3-disable_gemm_plugin-disable_attention_plugin-disable_weight_only-float16-nb:1-use_python_runtime] SKIP (https://nvbugs/5244570)
|
||||
unittest/_torch/speculative/test_eagle3.py SKIP (https://nvbugs/5280806)
|
||||
test_e2e.py::test_ptp_quickstart_multimodal[qwen2-vl-7b-instruct-Qwen2-VL-7B-Instruct-image] SKIP (https://nvbugs/5226211)
|
||||
triton_server/test_triton_rcca.py::test_mistral_beam_search[rcca_4714407-True-10-False-True-False-0-128-disableDecoupleMode-inflight_fused_batching-disableTrtOverlap-guaranteed_no_evict-1-1-1-False-ensemble] SKIP (https://nvbugs/5240060)
|
||||
triton_server/test_triton.py::test_triton_extensive[triton-extensive] SKIP
|
||||
triton_server/test_triton.py::test_gpt_speculative_decoding[gpt-speculative-decoding] SKIP
|
||||
|
||||
@ -19,15 +19,22 @@ from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
|
||||
@pytest.mark.parametrize("tp_size", [1, 8], ids=["tp1", "tp8"])
|
||||
@pytest.mark.parametrize("use_cuda_graph", [True, False],
|
||||
ids=["enable_graph", "disable_graph"])
|
||||
@pytest.mark.parametrize("enable_attention_dp", [True, False],
|
||||
ids=["enable_adp", "disable_adp"])
|
||||
@pytest.mark.parametrize("ep_size", [4, 1], ids=["ep4", "ep1"])
|
||||
@pytest.mark.parametrize("pp_size", [1, 8], ids=["pp1", "pp8"])
|
||||
def test_llama4(model_name, backend, tp_size, use_cuda_graph, ep_size, pp_size):
|
||||
def test_llama4(model_name, backend, tp_size, use_cuda_graph,
|
||||
enable_attention_dp, ep_size, pp_size):
|
||||
if pp_size > 1 and (ep_size > 1 or tp_size > 1):
|
||||
return
|
||||
|
||||
if pp_size == 1 and tp_size == 1:
|
||||
return
|
||||
|
||||
if enable_attention_dp and not (tp_size == 8 and ep_size == 4
|
||||
and pp_size == 1):
|
||||
pytest.skip("Skip this attention DP test case to avoid too many tests")
|
||||
|
||||
prompts = [{
|
||||
"prompt": "The president of the United States is"
|
||||
}, {
|
||||
@ -52,6 +59,7 @@ def test_llama4(model_name, backend, tp_size, use_cuda_graph, ep_size, pp_size):
|
||||
moe_tensor_parallel_size=tp_size // ep_size,
|
||||
pytorch_backend_config=pytorch_config,
|
||||
pipeline_parallel_size=pp_size,
|
||||
enable_attention_dp=enable_attention_dp,
|
||||
)
|
||||
with llm:
|
||||
outputs = llm.generate(
|
||||
|
||||
Loading…
Reference in New Issue
Block a user