chore: Remove deprecated Python runtime benchmark (#4171)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
Kaiyu Xie 2025-05-14 18:41:05 +08:00 committed by GitHub
parent f4059c6e2e
commit 6c45586c51
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
26 changed files with 14 additions and 2430 deletions

View File

@ -2,11 +2,9 @@
## Overview
There are currently three workflows to benchmark TensorRT-LLM:
There are currently two workflows to benchmark TensorRT-LLM:
* [`trtllm-bench`](../docs/source/performance/perf-benchmarking.md)
- `trtllm-bench` is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
- _NOTE_: This benchmarking suite is a current work in progress and is prone to large changes.
* [C++ benchmarks](./cpp)
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
* [Python benchmarks](./python)
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
* [The Python benchmarking suite](../docs/source/performance/perf-benchmarking.md)
- This benchmarker is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
- _NOTE_: This benchmarking suite is a current work in progress and is prone to large changes.

View File

@ -1,51 +0,0 @@
# Benchmark Python Runtime
> [!WARNING] Python benchmark is not recommended to be used for benchmarking, please use C++ benchmark instead
> The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
multiple GPUs or multiple nodes with multiple GPUs using the Python runtime.
## Overview
The benchmark implementation and entrypoint can be found in [`benchmarks/python/benchmark.py`](./benchmark.py). There are some other scripts in the directory:
* [`benchmarks/python/base_benchmark.py`](./base_benchmark.py) to implement the base class for benchmark.
* [`benchmarks/python/gpt_benchmark.py`](./gpt_benchmark.py) to implement benchmark scripts for GPT and GPT-like(LLaMA/OPT/GPT-J/SmoothQuant-GPT) models.
* [`benchmarks/python/bert_benchmark.py`](./bert_benchmark.py) to implement benchmark scripts for BERT models.
* [`benchmarks/python/enc_dec_benchmark.py`](./enc_dec_benchmark.py) to implement benchmark scripts for Encoder-Decoder models.
## Usage
Please use `help` option for detailed usages.
```
python benchmark.py -h
```
### 1. Single GPU benchmark
Take LLaMA 7B as an example:
```
python benchmark.py \
-m dec \
--engine_dir llama_7b \
--batch_size "1;8;64" \
--input_output_len "60,20;128,20"
```
Expected outputs:
```
[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 170.77 percentile95(ms) 117.591 percentile99(ms) 124.262 latency(ms) 117.115 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 110.189 total_generated_tokens 19.0 generation_tokens_per_second 172.43
[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 8 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 1478.55 percentile95(ms) 108.641 percentile99(ms) 109.546 latency(ms) 108.214 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 98.194 total_generated_tokens 152.0 generation_tokens_per_second 1547.951
[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 64 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 8214.87 percentile95(ms) 156.748 percentile99(ms) 160.203 latency(ms) 155.815 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 111.078 total_generated_tokens 1216.0 generation_tokens_per_second 10947.303
...
```
*Please note that the expected outputs is only for reference, specific performance numbers depend on the GPU you're using.*
### 2. Multi-GPU benchmark
Take LLaMA 7B as an example:
```
mpirun -n 2 python benchmark.py \
-m dec \
--engine_dir llama_7b \
--batch_size "1;8;64" \
--input_output_len "60,20;128,20"
```

View File

@ -1,139 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from argparse import ArgumentParser
# isort: off
import torch
# isort: on
from cuda import cuda, cudart
import tensorrt_llm as tllm
from tensorrt_llm import Mapping, Tensor
from tensorrt_llm._utils import local_mpi_rank, local_mpi_size
from tensorrt_llm.functional import (AllReduceParams, AllReduceStrategy,
allreduce)
from tensorrt_llm.plugin.plugin import (current_all_reduce_helper,
init_all_reduce_helper)
from tensorrt_llm.runtime import Session
def allreduce_benchmark(dtype: str,
test_range: str = "10,10000000,10",
no_header: bool = False):
tllm.logger.set_level('error')
world_size = tllm.mpi_world_size()
rank = tllm.mpi_rank()
local_rank = local_mpi_rank()
gpus_per_node = local_mpi_size()
torch.cuda.set_device(local_rank)
cudart.cudaSetDevice(local_rank)
mapping = Mapping(world_size, rank, gpus_per_node, tp_size=world_size)
if world_size == 1:
raise RuntimeError("Benchmark must run with mpi_world_size > 1")
torch_dtype = tllm._utils.str_dtype_to_torch(dtype)
min_size, max_size, ratio = [int(i) for i in test_range.split(",")]
inner_loop = 1000
size = min_size
dtype_size = torch.finfo(torch_dtype).bits // 8
if mapping.rank == 0 and not no_header:
print(
f"{'world_size':<15}, {'dtype':<10}, {'message size':<15}, {'strategy':<15}, {'duration (ms)':<10}"
)
while size < max_size:
input = torch.ones(size, dtype=torch_dtype, device="cuda")
for strategy in [
AllReduceStrategy.AUTO,
AllReduceStrategy.NCCL,
AllReduceStrategy.ONESHOT,
AllReduceStrategy.TWOSHOT,
]:
builder = tllm.Builder()
net = builder.create_network()
net.plugin_config.set_nccl_plugin(dtype)
init_all_reduce_helper()
_buffers, workspace = current_all_reduce_helper(
).allocate_workspace(mapping, size * dtype_size)
with tllm.net_guard(net):
tllm.default_trtnet()
x = Tensor(name='x',
shape=input.shape,
dtype=tllm.str_dtype_to_trt(dtype))
current_all_reduce_helper().set_workspace_tensor(mapping)
current = x
for _ in range(inner_loop):
current = allreduce(
current,
mapping.tp_group,
all_reduce_params=AllReduceParams(strategy=strategy))
current.mark_output('output', dtype)
feed_dict = {'x': input, 'all_reduce_workspace': workspace}
builder_config = builder.create_builder_config(precision=dtype)
engine = builder.build_engine(net, builder_config)
assert engine is not None, "Failed to build engine"
session = Session.from_serialized_engine(engine)
_, start = cuda.cuEventCreate(0)
_, stop = cuda.cuEventCreate(0)
runtimes = []
tllm.mpi_barrier()
output = torch.empty(input.shape, dtype=torch_dtype, device='cuda')
stream = torch.cuda.current_stream()
for _ in range(10):
cuda.cuEventRecord(start, stream.cuda_stream)
session.run(inputs=feed_dict,
outputs={"output": output},
stream=stream.cuda_stream)
cuda.cuEventRecord(stop, stream.cuda_stream)
torch.cuda.synchronize()
_, ms = cuda.cuEventElapsedTime(start, stop)
runtimes.append(ms)
median_ms = sorted(runtimes)[len(runtimes) // 2]
allreduce_ref = (input * world_size)**inner_loop
assert torch.allclose(output, allreduce_ref)
if mapping.rank == 0:
print(
f"{mapping.world_size:<15}, {dtype:<10}, {size:<15}, {strategy.name:<15}, {median_ms:<10.2f}"
)
size *= ratio
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("--dtype", "-t", default="float16")
parser.add_argument(
"--range",
"-r",
default="256,256000000,10", # 256 to 256M
help="min_size,max_size,multiplicative_ratio")
parser.add_argument("--no-header", action="store_true")
args = parser.parse_args()
allreduce_benchmark(args.dtype, args.range, args.no_header)

View File

@ -1,211 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import subprocess
import time
from collections import OrderedDict
import torch
import tensorrt_llm
from tensorrt_llm.logger import logger
from tensorrt_llm.quantization import QuantMode
def get_compute_cap():
output = subprocess.check_output(
['nvidia-smi', "--query-gpu=compute_cap", "--format=csv"])
_, csv_value, *_ = output.splitlines()
return str(int(float(csv_value) * 10))
def get_csv_filename(model, dtype, tp_size, **kwargs):
sm = get_compute_cap()
if len(kwargs) == 0:
kw_pairs = ""
else:
kw_pairs = "_" + "_".join([str(k) + str(v) for k, v in kwargs.items()])
return f'{model}_{dtype}_tp{tp_size}_{kw_pairs}_sm{sm}.csv'
def get_engine_name(model, dtype, tp_size, rank):
return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
def serialize_engine(engine, path):
logger.info(f'Serializing engine to {path}...')
tik = time.time()
with open(path, 'wb') as f:
# engine object is already complies with python buffer protocol, no need to
# convert it to bytearray before write, converting to bytearray consumes lots of memory
f.write(engine)
tok = time.time()
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
logger.info(f'Engine serialized. Total time: {t}')
def get_last_path_component(path):
normalized_path = os.path.normpath(path)
last_component = os.path.basename(normalized_path)
return last_component
class BaseBenchmark(object):
def __init__(self, engine_dir, model_name, dtype, rank, world_size):
self.engine_dir = engine_dir
self.model_name = model_name
self.dtype = dtype
self.runtime_rank = rank
self.world_size = world_size
self.engine_model_name = model_name
self.quant_mode = QuantMode(0)
self.enable_fp8 = False
# Read config from engine directory
config_path = os.path.join(engine_dir, 'config.json')
with open(config_path, 'r') as f:
self.config = json.load(f)
# Sanity checks
if 'pretrained_config' in self.config: # new build api branch
config_dtype = self.config['pretrained_config']['dtype']
assert dtype == config_dtype, f"Engine dtype ({config_dtype}) != Runtime dtype ({dtype})"
world_size = self.config['pretrained_config']['mapping'][
'world_size']
assert world_size == self.world_size, \
(f'Engine world size ({world_size}) != Runtime world size ({self.world_size})')
# Load config into self
for key, value in self.config['pretrained_config'].items():
setattr(self, key, value)
self.quant_mode = QuantMode.from_quant_algo(
quant_algo=self.quantization['quant_algo'],
kv_cache_quant_algo=self.quantization['kv_cache_quant_algo'])
self.enable_fp8 = self.quant_mode.has_fp8_qdq()
self.fp8_kv_cache = self.quant_mode.has_fp8_kv_cache()
for key, value in self.config['build_config'].items():
setattr(self, key, value)
for key, value in self.plugin_config.items():
if "plugin" in key:
key = "use_" + key
setattr(self, key, value)
self.engine_name = f"rank{self.runtime_rank}.engine"
self.num_kv_heads = self.num_key_value_heads
self.num_layers = self.num_hidden_layers
self.num_heads = self.num_attention_heads
else:
# Read config from engine directory
config_path = os.path.join(engine_dir, 'config.json')
with open(config_path, 'r') as f:
self.config = json.load(f)
# Sanity checks
config_dtype = self.config['builder_config']['precision']
assert dtype == config_dtype, f"Engine dtype ({config_dtype}) != Runtime dtype ({dtype})"
world_size = self.config['builder_config']['tensor_parallel']
assert world_size == self.world_size, \
(f'Engine world size ({world_size}) != Runtime world size ({self.world_size})')
# Load config into self
for key, value in self.config['builder_config'].items():
if key == "quant_mode":
self.quant_mode = QuantMode(value)
elif key in "name":
self.engine_model_name = value
else:
setattr(self, key, value)
self.enable_fp8 = self.quant_mode.has_fp8_qdq()
self.fp8_kv_cache = self.quant_mode.has_fp8_kv_cache()
for key, value in self.config['plugin_config'].items():
# Same effect as self.use_foo_plugin = config.json["foo_plugin"]
if "plugin" in key:
key = "use_" + key
setattr(self, key, value)
self.engine_name = get_engine_name(self.engine_model_name,
self.dtype, self.world_size,
self.runtime_rank)
self.runtime_mapping = tensorrt_llm.Mapping(world_size=self.world_size,
rank=self.runtime_rank,
tp_size=self.world_size)
torch.cuda.set_device(self.runtime_rank %
self.runtime_mapping.gpus_per_node)
self.csv_filename = "" # lazy init
def get_report_dict(self, benchmark_profiler=None):
report_fields = [
"engine_dir",
"world_size",
"num_heads",
"num_kv_heads",
"num_layers",
"hidden_size",
"vocab_size",
"precision",
"batch_size",
"gpu_weights_percent",
"input_length",
"output_length",
"gpu_peak_mem(gb)",
"build_time(s)",
"tokens_per_sec",
"percentile95(ms)",
"percentile99(ms)",
"latency(ms)",
"compute_cap",
]
report_dict = OrderedDict.fromkeys(report_fields)
report_dict["engine_dir"] = get_last_path_component(self.engine_dir)
report_dict["world_size"] = self.world_size
report_dict["precision"] = self.dtype
report_dict["quantization"] = str(self.quant_mode)
report_dict["compute_cap"] = "sm" + get_compute_cap()
return report_dict
def get_csv_filename(self):
if len(self.csv_filename) == 0:
self.csv_filename = get_csv_filename(get_last_path_component(
self.engine_dir),
self.dtype,
self.world_size,
fp8linear=int(self.enable_fp8))
return self.csv_filename
def print_report_header(self, csv=False, benchmark_profiler=None):
if csv and self.runtime_rank == 0:
report_dict = self.get_report_dict(benchmark_profiler)
line = ",".join(report_dict.keys())
print(line)
with open(self.get_csv_filename(), "a") as file:
file.write(line + "\n")
def get_config(self):
raise NotImplementedError
def prepare_inputs(self, config):
raise NotImplementedError
def run(self, inputs, config, benchmark_profiler=None):
raise NotImplementedError
def report(self, config, latency):
raise NotImplementedError
def set_weight_streaming(self, config):
raise NotImplementedError

View File

@ -1,354 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import multiprocessing as mp
from time import time
import torch
def parse_arguments():
parser = argparse.ArgumentParser(
description='Benchmark TensorRT-LLM models.')
parser.add_argument('-m',
'--model',
type=str,
default="dec",
choices=["dec", "enc", "enc-dec"],
help='Specify type of the model you want to benchmark. '
'Choose model between dec/enc/enc-dec.')
parser.add_argument('--batch_size',
type=str,
default="8",
help=('Specify batch size(s) you want to benchmark. '
'Multiple batch sizes can be separated by \";\", '
'example: \"1;8;64\".'))
parser.add_argument(
'--input_len',
type=str,
default="128",
help=('Specify input length(s) you want to benchmark, '
'this option is mainly for BERT. '
'Multiple input lengths can be separated by \";\", '
'example: \"20;60;128\".'))
parser.add_argument(
'--input_output_len',
type=str,
default="128,20",
help=('Specify input-output length(s) you want to benchmark, '
'this option is mainly for GPT and GPT-like models. '
'Multiple input lengths can be separated by \";\", '
'example: \"60,20;128,20\".'))
parser.add_argument(
'--dtype',
type=str,
default='float16',
choices=['float16', 'bfloat16', 'float32'],
help='Choose data type between float16/bfloat16/float32.')
parser.add_argument('--num_beams',
type=int,
default="1",
help=('Specify number of beams you want to benchmark.'))
parser.add_argument('--top_k',
type=int,
default="1",
help=('Specify Top-K value of decoding.'))
parser.add_argument('--top_p',
type=float,
default="0",
help=('Specify Top-P value of decoding.'))
parser.add_argument(
'--input_timing_cache',
type=str,
default=None,
help=
'The path to read timing cache, will be ignored if the file does not exist'
)
parser.add_argument('--output_timing_cache',
type=str,
default='model.cache',
help='The path to write timing cache')
parser.add_argument(
'--log_level',
type=str,
default="error",
choices=['verbose', 'info', 'warning', 'error', 'internal_error'],
help=
'Choose log level between verbose/info/warning/error/internal_error.')
parser.add_argument(
'--warm_up',
type=int,
default=2,
help='Specify warm up iterations before benchmark starts.')
parser.add_argument(
'--num_runs',
type=int,
default=10,
help='Minimal number of iterations to run during benchmarking.')
parser.add_argument(
'--duration',
type=int,
default=60,
help='Minimal duration of iterations to measure in seconds.')
parser.add_argument(
'--engine_dir',
type=str,
default=None,
required=True,
help=
('If this option is specified, instead of building engines on-air before benchmarking, '
'the engines contained in the engine_dir will be used.'))
parser.add_argument(
'--gpu_weights_percent',
type=str,
default="1.0",
help='Specify the percentage of weights that reside on GPU (from 0 to 1).'
'Multiple percentages can be separated by \";\", '
'example: \"0;0.5;1\".')
parser.add_argument('--csv',
default=False,
action="store_true",
help='Output in CSV format.')
parser.add_argument('--enable_cuda_graph',
default=False,
action='store_true',
help='Execute GPT session with CUDA graph.')
parser.add_argument(
'--quantization',
type=str,
default=None,
choices=[
'fp8', 'fp8_gemm', 'fp8_kv_cache', 'int8_sq_per_tensor',
'int8_sq_per_token_channel', 'int8_weight_only', 'int4_weight_only',
'int4_weight_only_awq', 'int4_weight_only_gptq',
'int8_sq_per_channel_ootb'
],
help="Optimize the model with specified quantization recipe")
parser.add_argument(
'--dump_profile',
default=False,
action='store_true',
help="Print profile information per layer (default = disabled)")
parser.add_argument(
'--dump_layer_info',
default=False,
action='store_true',
help=
"Print layer information of the engine to console (default = disabled)")
return parser.parse_args()
def main(args):
# We import tensorrt_llm here because MPI is initialized when
# tensorrt_llm is imported, but mpi4py does not work well with
# the start method `spawn` of Python multiprocessing,
# so we set the start method first, then initialize MPI.
from benchmark_profiler import BenchmarkProfiler
from bert_benchmark import BERTBenchmark
from enc_dec_benchmark import EncDecBenchmark
from gpt_benchmark import GPTBenchmark
import tensorrt_llm
from tensorrt_llm.logger import logger
logger.set_level(args.log_level)
# Batch size
batch_size_options = args.batch_size.split(';')
batch_size_options = [int(i) for i in batch_size_options]
# Input length (for BERT-like models)
input_len_options = args.input_len.split(';')
input_len_options = [int(i) for i in input_len_options]
# Input-output length combination (for GPT-like models and enc_dec models)
in_out_len_options = args.input_output_len.split(';')
in_out_len_options = [[int(i) for i in io.split(',')]
for io in in_out_len_options]
# GPU weights percentage ratios
gpu_weights_percents = [
float(r) for r in args.gpu_weights_percent.split(";")
]
for percent in gpu_weights_percents:
if percent < 0 or percent > 1:
raise Exception(
f"--gpu_weights_percent only accepts values between 0.0 and 1.0."
)
rank = tensorrt_llm.mpi_rank()
world_size = tensorrt_llm.mpi_world_size()
# TODO: Re-enable memory monitor for multi-gpu benchmarks.
# Current Mem Monitor will cause benchmark script hang
# because MPI does not work well with multiprocessing.
disable_mem_monitor = world_size > 1
if not disable_mem_monitor:
from mem_monitor import MemoryMonitor
benchmark_profiler = None
if args.model == "dec":
benchmark_profiler = BenchmarkProfiler()
benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
gpu_weights_percents, rank, world_size)
elif args.model == "enc":
benchmarker = BERTBenchmark(args, batch_size_options, input_len_options,
gpu_weights_percents, rank, world_size)
elif args.model == "enc-dec":
benchmarker = EncDecBenchmark(args, batch_size_options,
in_out_len_options, gpu_weights_percents,
rank, world_size)
else:
raise Exception(f'Unexpected model: {args.model}')
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
benchmarker.print_report_header(args.csv,
benchmark_profiler=benchmark_profiler)
for config in benchmarker.get_config():
try:
# We pass in config instead of the gpu_weights_percent here to keep this benchmark script
# agnostic to the length and contents of the config.
benchmarker.set_weight_streaming(config)
inputs = benchmarker.prepare_inputs(config)
except torch.cuda.OutOfMemoryError as e:
logger.error(
f'Exception {e} caught while allocating memory; skipping {config}'
)
continue
torch.cuda.empty_cache()
latencies = []
# Disable Host memory monitor when cuda graph is enabled for cuda graph performance.
disable_host_mem_monitor = False
if args.enable_cuda_graph:
logger.warning(
'Disable host memory monitor when cuda graph is enabled.')
disable_host_mem_monitor = True
if not disable_mem_monitor:
memory_monitor = MemoryMonitor(
disable_host_mem_monitor=disable_host_mem_monitor)
memory_monitor.start()
iter_idx = 0
try:
# Warm up
for _ in range(args.warm_up):
benchmarker.run(inputs, config)
logger.info('Warm up done. Start benchmarking.')
if benchmark_profiler is not None:
benchmark_profiler.clean()
benchmark_profiler.start()
cur_duration = 0
start_time = time()
while iter_idx < args.num_runs or cur_duration < args.duration:
start.record()
benchmarker.run(inputs,
config,
benchmark_profiler=benchmark_profiler)
end.record()
torch.cuda.synchronize()
latencies.append(start.elapsed_time(end))
iter_idx += 1
cur_duration = round(time() - start_time, 3)
logger.info(
f'Benchmarking done. Iteration: {iter_idx}, duration: {cur_duration} sec.'
)
except Exception as e:
logger.error("Found exception during benchmarking",
e.with_traceback())
if not disable_mem_monitor:
memory_monitor.kill()
raise e
if not disable_mem_monitor:
memory_monitor.stop()
_, peak_gpu_used = memory_monitor.get_peak_memory_usage("GiB")
peak_gpu_used = round(peak_gpu_used, 3)
else:
peak_gpu_used = 0.0
if benchmark_profiler is not None:
benchmark_profiler.add_aux_info('iter_count', iter_idx)
benchmark_profiler.stop()
# Print latencies to make it easier to check perf stability.
if len(latencies) <= 20:
latencies_str = str(latencies)
else:
latencies_str = ("[" + ", ".join([str(l) for l in latencies[:10]]) +
"..." +
", ".join([str(l) for l in latencies[-10:]]) + "]")
logger.info(f"Latencies: {latencies_str}")
latency = round(sum(latencies) / iter_idx, 3)
latencies.sort()
percentile95 = round(latencies[int(iter_idx * 0.95)], 3)
percentile99 = round(latencies[int(iter_idx * 0.99)], 3)
benchmarker.report(config,
latency,
percentile95,
percentile99,
peak_gpu_used,
csv=args.csv,
benchmark_profiler=benchmark_profiler)
# Rerun for dumping profile per layer.
if args.dump_profile and benchmark_profiler is not None:
benchmark_profiler.set_recording_perf_profile(True)
logger.info(f'Dump profile information per layer')
iter_idx = 0
try:
# Warm up
for _ in range(args.warm_up):
benchmarker.run(inputs, config)
if benchmark_profiler is not None:
benchmark_profiler.clean()
benchmark_profiler.start()
cur_duration = 0
start_time = time()
while iter_idx < args.num_runs or cur_duration < args.duration:
start.record()
benchmarker.run(inputs,
config,
benchmark_profiler=benchmark_profiler)
end.record()
torch.cuda.synchronize()
latencies.append(start.elapsed_time(end))
iter_idx += 1
cur_duration = round(time() - start_time, 3)
benchmarker.report_profiler(
benchmark_profiler=benchmark_profiler)
except Exception as e:
logger.error("Found exception during benchmarking",
e.with_traceback())
if not disable_mem_monitor:
memory_monitor.kill()
raise e
if __name__ == '__main__':
mp.set_start_method('spawn')
args = parse_arguments()
main(args)

View File

@ -1,82 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
class BenchmarkProfiler(object):
cuda_event_dict: dict
timer_dict: dict
aux_info: dict
started: bool
is_recording_perf_profile: bool
def __init__(self):
self.cuda_event_dict = {}
self.timer_dict = {}
self.aux_info = {}
self.started = False
self.is_recording_perf_profile = False
def clean(self):
self.cuda_event_dict = {}
self.timer_dict = {}
self.aux_info = {}
def start(self):
self.started = True
def stop(self):
self.started = False
def get_cuda_event(self, name: str):
if name not in self.cuda_event_dict.keys():
event = torch.cuda.Event(enable_timing=True)
self.cuda_event_dict[name] = event
return self.cuda_event_dict[name]
def record_cuda_event(self, name: str):
if not self.started:
return
event = self.get_cuda_event(name)
event.record()
def get_timer_value(self, timer_name: str):
# timer is in milliseconds
return self.timer_dict[timer_name]
def record_elapsed_time(self, start_event_name: str, end_event_name: str,
timer_name: str):
if timer_name not in self.timer_dict.keys():
self.timer_dict[timer_name] = 0.0
if not self.started:
return
self.get_cuda_event(start_event_name).synchronize()
self.get_cuda_event(end_event_name).synchronize()
self.timer_dict[timer_name] += self.get_cuda_event(
start_event_name).elapsed_time(self.get_cuda_event(end_event_name))
def get_aux_info(self, aux_name):
return self.aux_info[aux_name]
def add_aux_info(self, aux_name: str, add_value):
if aux_name not in self.aux_info.keys():
self.aux_info[aux_name] = 0
if not self.started:
return
self.aux_info[aux_name] += add_value
def set_recording_perf_profile(self, value: bool):
self.is_recording_perf_profile = value

View File

@ -1,137 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
# isort: off
import torch
import tensorrt as trt
#isort: on
from base_benchmark import BaseBenchmark
import tensorrt_llm
from tensorrt_llm._utils import trt_dtype_to_torch
from tensorrt_llm.runtime import TensorInfo
class BERTBenchmark(BaseBenchmark):
def __init__(self, args, batch_sizes, in_lens, gpu_weights_percents, rank,
world_size):
super().__init__(args.engine_dir, args.model, args.dtype, rank,
world_size)
self.batch_sizes = batch_sizes
self.in_lens = in_lens
self.build_time = 0
self.gpu_weights_percents = gpu_weights_percents
# Deserialize engine from engine directory
self.serialize_path = os.path.join(args.engine_dir, self.engine_name)
with open(self.serialize_path, 'rb') as f:
engine_buffer = f.read()
assert engine_buffer is not None
self.session = tensorrt_llm.runtime.Session.from_serialized_engine(
engine_buffer)
# Print context memory size for CI/CD to track.
context_mem_size = self.session.context_mem_size
print(
f"Allocated {context_mem_size / 1048576.0:.2f} MiB for execution context memory."
)
def get_config(self):
for inlen in self.in_lens:
if inlen > self.max_input_len:
continue
for batch_size in self.batch_sizes:
if batch_size > self.max_batch_size:
continue
for gpu_weights_percent in self.gpu_weights_percents:
yield (batch_size, inlen, gpu_weights_percent)
def set_weight_streaming(self, config):
gpu_weights_percent = config[2]
self.session._set_weight_streaming(gpu_weights_percent)
def prepare_inputs(self, config):
batch_size, inlen = config[0], config[1]
input_ids = torch.randint(100, (batch_size, inlen)).int().cuda()
input_lengths = inlen * torch.ones(
(batch_size, ), dtype=torch.int32, device='cuda')
inputs = {'input_ids': input_ids, 'input_lengths': input_lengths}
output_info = self.session.infer_shapes([
TensorInfo('input_ids', trt.DataType.INT32, input_ids.shape),
TensorInfo('input_lengths', trt.DataType.INT32, input_lengths.shape)
])
outputs = {
t.name:
torch.empty(tuple(t.shape),
dtype=trt_dtype_to_torch(t.dtype),
device='cuda')
for t in output_info
}
stream = torch.cuda.current_stream().cuda_stream
return (inputs, outputs, stream)
def run(self, inputs, config, benchmark_profiler=None):
ok = self.session.run(*inputs)
assert ok, "Runtime execution failed"
torch.cuda.synchronize()
def report(self, config, latency, percentile95, percentile99,
peak_gpu_used):
if self.runtime_rank == 0:
line = '[BENCHMARK] ' + (
f'model_name {self.model_name} world_size {self.world_size} precision {self.dtype} '
f'batch_size {config[0]} input_length {config[1]} gpu_peak_mem(gb) {peak_gpu_used} '
f'build_time(s) {self.build_time} percentile95(ms) {percentile95} '
f'percentile99(ms) {percentile99} latency(ms) {latency}')
print(line)
def report(self,
config,
latency,
percentile95,
percentile99,
peak_gpu_used,
csv,
benchmark_profiler=None):
report_dict = super().get_report_dict()
batch_size, inlen = config[0], config[1]
report_dict["num_heads"] = self.num_heads
report_dict["num_kv_heads"] = self.num_heads
report_dict["num_layers"] = self.num_layers
report_dict["hidden_size"] = self.hidden_size
report_dict["vocab_size"] = self.vocab_size
report_dict["batch_size"] = batch_size
report_dict["input_length"] = inlen
report_dict["output_length"] = "n/a"
report_dict["gpu_weights_percent"] = config[2]
report_dict["latency(ms)"] = latency
report_dict["build_time(s)"] = self.build_time
report_dict["tokens_per_sec"] = "n/a"
report_dict["percentile95(ms)"] = percentile95
report_dict["percentile99(ms)"] = percentile99
report_dict["gpu_peak_mem(gb)"] = peak_gpu_used
if self.runtime_rank == 0:
if csv:
line = ",".join([str(v) for v in report_dict.values()])
print(line)
with open(self.get_csv_filename(), "a") as file:
file.write(line + "\n")
else:
kv_pairs = [f"{k} {v}" for k, v in report_dict.items()]
line = '[BENCHMARK] ' + " ".join(kv_pairs)
print(line)

View File

@ -1,174 +0,0 @@
import json
import os
from enum import Enum
import evaluate
import nltk
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, LlamaTokenizerFast
nltk.download("punkt", quiet=False)
nltk.download('punkt_tab')
import argparse
class Model(Enum):
Llama_v2_70B = 1
GPT_J = 2
ACCURACY_TARGETS = {
Model.Llama_v2_70B: {
"rouge1": 44.4312 * 0.999,
"rouge2": 22.0352 * 0.999,
"rougeL": 28.6162 * 0.999,
"tokens_per_sample": 294.45 * 0.9
},
Model.GPT_J: {
"rouge1": 42.9865 * 0.99,
"rouge2": 20.1235 * 0.99,
"rougeL": 29.9881 * 0.99,
}
}
def get_reference_df(processed_dataset_file):
data = pd.read_pickle(processed_dataset_file)
return data["output"].tolist()
def get_reference_json(cnn_dailymail_valset):
# Load from CNN dailymail
with open(cnn_dailymail_valset, 'r') as fh:
list_data_dict = json.load(fh)
targets = [f"{example['output']}" for example in list_data_dict]
print(f"Loaded {len(targets)} samples from {cnn_dailymail_valset}")
return targets
def get_responses_json(response_file):
f = open(response_file)
responses = json.load(f)
ordered_responses = sorted(responses, key=lambda x: int(x['response_id']))
return ordered_responses
def postprocess_text(preds, targets):
# Post-process output texts for ROUGE evaluation
preds = [pred.strip() for pred in preds]
targets = [target.strip() for target in targets]
# rougeLSum expects newline after each sentence
preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
targets = ["\n".join(nltk.sent_tokenize(target)) for target in targets]
return preds, targets
def strip_eos(pred_toks, eos_id):
while len(pred_toks) > 0 and pred_toks[-1] == eos_id:
pred_toks.pop()
if len(pred_toks) == 0:
raise RuntimeError("Empty output sequence detected with EOS")
return pred_toks
def calculate_toks_per_sample(preds, eos_id):
preds = [strip_eos(pred, eos_id) for pred in preds]
avg_len = sum(len(pred) for pred in preds)
num_samples = len(preds)
return avg_len / num_samples
def calculate_rouge_score(preds, targets, rouge_dir=None):
print("Calculating ROUGE scores...")
rouge_dir = rouge_dir if rouge_dir and os.path.exists(
rouge_dir) else "rouge"
metric = evaluate.load(rouge_dir)
preds, targets = postprocess_text(preds, targets[0:len(preds)])
result = metric.compute(predictions=preds,
references=targets,
use_stemmer=True,
use_aggregator=False)
result = {k: round(np.mean(v) * 100, 4) for k, v in result.items()}
prediction_lens = [len(pred) for pred in preds]
result["gen_len"] = np.sum(prediction_lens)
result["gen_num"] = len(preds)
return result
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument(
"--dataset",
type=str,
help=
"Path to the reference dataset against which the responses are evaluated for accuracy. MLPerf uses open-orca (pkl) and cnn-dailymail (np) for Llama2-70B and GPT-J respectively."
)
parser.add_argument(
"--responses",
type=str,
help="Path to the json file holding the responses from our benchmark run"
)
parser.add_argument("--base_model",
type=str,
help="Location of the model used (to create tokenizer)")
parser.add_argument(
'--rouge_dir',
default=None,
type=str,
help=
"evaluate.load('rouge') will attempt to pull rouge package from HF. Use cached rouge can avoid network outage of host or HF."
)
args = parser.parse_args()
return args
def main():
args = parse_arguments()
if args.dataset.lower().endswith(".pkl"):
target_texts = get_reference_df(args.dataset)
model = Model.Llama_v2_70B
tokenizer = LlamaTokenizerFast.from_pretrained(args.base_model)
elif args.dataset.lower().endswith(".json"):
target_texts = get_reference_json(args.dataset)
model = Model.GPT_J
tokenizer = AutoTokenizer.from_pretrained(args.base_model,
model_max_length=2047,
padding_side="left",
use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
else:
raise RuntimeError(
"Dataset expected to be pkl (open-orca) or json (cnn-dailymail)")
pred_out = get_responses_json(args.responses)
pred_toks = [x['response_tokens'] for x in pred_out]
tps_score = calculate_toks_per_sample(pred_toks, tokenizer.eos_token)
pred_texts = tokenizer.batch_decode(pred_toks, skip_special_tokens=True)
achieved_scores = calculate_rouge_score(pred_texts, target_texts,
args.rouge_dir)
achieved_scores['tokens_per_sample'] = tps_score
targets = ACCURACY_TARGETS[model]
print("Achieved rouge scores: ", achieved_scores)
print("Tokens per sample: ", tps_score)
print("Targets: ", targets)
for k, _ in targets.items():
assert targets[k] <= achieved_scores[k]
if __name__ == "__main__":
main()

View File

@ -1,456 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
# isort: off
import torch
#isort: on
from base_benchmark import BaseBenchmark
import tensorrt_llm
from tensorrt_llm._utils import (trt_dtype_to_torch, str_dtype_to_trt)
from tensorrt_llm.quantization import QuantMode
from tensorrt_llm.runtime.session import TensorInfo
from tensorrt_llm.runtime import ModelConfig
from tensorrt_llm.models.modeling_utils import get_kv_cache_type_from_legacy
class EncDecBenchmark(BaseBenchmark):
def __init__(self, args, batch_sizes, in_out_lens, gpu_weights_percents,
rank, world_size):
self.engine_dir = args.engine_dir
self.model_name = args.model
self.enable_fp8 = False # hardcode for enc-dec models
self.dtype = args.dtype
self.runtime_rank = rank
self.world_size = world_size
self.csv_filename = "" # lazy init
self.batch_sizes = batch_sizes
self.in_out_lens = in_out_lens
self.num_beams = args.num_beams
self.build_time = 0
self.quant_mode = QuantMode(0)
# In current implementation, encoder and decoder have the same name,
# builder config, and plugin config. But they can be different in the future.
# So we use separate variables for encoder and decoder here.
self.encoder_engine_model_name = args.model
self.decoder_engine_model_name = args.model
self.gpu_weights_percents = gpu_weights_percents
# only for whisper parameter
self.n_mels = 0
if self.engine_dir is not None:
def read_config(component):
# almost same as enc_dec_model_runner.py::read_config()
config_path = os.path.join(self.engine_dir, component,
"config.json")
with open(config_path, "r") as f:
config = json.load(f)
builder_config = config['build_config']
plugin_config = builder_config['plugin_config']
pretrained_config = config['pretrained_config']
lora_config = builder_config['lora_config']
auto_parallel_config = builder_config['auto_parallel_config']
use_gpt_attention_plugin = plugin_config["gpt_attention_plugin"]
gemm_allreduce_plugin = plugin_config["gemm_allreduce_plugin"]
remove_input_padding = plugin_config["remove_input_padding"]
use_lora_plugin = plugin_config["lora_plugin"]
tp_size = pretrained_config['mapping']['tp_size']
pp_size = pretrained_config['mapping']['pp_size']
auto_parallel_config['gpus_per_node']
world_size = tp_size * pp_size
assert world_size == tensorrt_llm.mpi_world_size(), \
f'Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})'
num_heads = pretrained_config["num_attention_heads"]
hidden_size = pretrained_config["hidden_size"]
head_size = pretrained_config["head_size"]
vocab_size = pretrained_config["vocab_size"]
max_batch_size = builder_config["max_batch_size"]
max_beam_width = builder_config["max_beam_width"]
num_layers = pretrained_config["num_hidden_layers"]
num_kv_heads = pretrained_config.get('num_kv_heads', num_heads)
assert (num_heads % tp_size) == 0
num_heads = num_heads // tp_size
hidden_size = hidden_size // tp_size
num_kv_heads = (num_kv_heads + tp_size - 1) // tp_size
cross_attention = pretrained_config[
"architecture"] == "DecoderModel"
skip_cross_kv = pretrained_config.get('skip_cross_kv', False)
has_position_embedding = pretrained_config[
"has_position_embedding"]
has_token_type_embedding = hasattr(pretrained_config,
"type_vocab_size")
dtype = pretrained_config["dtype"]
paged_kv_cache = plugin_config['paged_kv_cache']
kv_cache_type = get_kv_cache_type_from_legacy(
True, paged_kv_cache)
tokens_per_block = plugin_config['tokens_per_block']
gather_context_logits = builder_config.get(
'gather_context_logits', False)
gather_generation_logits = builder_config.get(
'gather_generation_logits', False)
max_prompt_embedding_table_size = builder_config.get(
'max_prompt_embedding_table_size', 0)
model_config = ModelConfig(
num_heads=num_heads,
num_kv_heads=num_kv_heads,
hidden_size=hidden_size,
head_size=head_size,
max_batch_size=max_batch_size,
max_beam_width=max_beam_width,
vocab_size=vocab_size,
num_layers=num_layers,
gpt_attention_plugin=use_gpt_attention_plugin,
gemm_allreduce_plugin=gemm_allreduce_plugin,
remove_input_padding=remove_input_padding,
kv_cache_type=kv_cache_type,
tokens_per_block=tokens_per_block,
cross_attention=cross_attention,
has_position_embedding=has_position_embedding,
has_token_type_embedding=has_token_type_embedding,
dtype=dtype,
gather_context_logits=gather_context_logits,
gather_generation_logits=gather_generation_logits,
max_prompt_embedding_table_size=
max_prompt_embedding_table_size,
lora_plugin=use_lora_plugin,
lora_target_modules=lora_config.get('lora_target_modules'),
trtllm_modules_to_hf_modules=lora_config.get(
'trtllm_modules_to_hf_modules'),
skip_cross_kv=skip_cross_kv,
)
# additional info for benchmark
self.max_batch_size = config["build_config"]["max_batch_size"]
self.max_input_len = config["build_config"][
"max_encoder_input_len"]
self.max_seq_len = config["build_config"]["max_seq_len"]
if component == "decoder":
self.decoder_start_token_id = pretrained_config[
'decoder_start_token_id']
return model_config
self.encoder_model_config = read_config("encoder")
self.decoder_model_config = read_config("decoder")
self.encoder_engine_name = 'rank{}.engine'.format(self.runtime_rank)
self.decoder_engine_name = 'rank{}.engine'.format(self.runtime_rank)
self.encoder_runtime_mapping = tensorrt_llm.Mapping(
world_size=self.world_size,
rank=self.runtime_rank,
tp_size=self.world_size,
)
self.decoder_runtime_mapping = tensorrt_llm.Mapping(
world_size=self.world_size,
rank=self.runtime_rank,
tp_size=self.world_size,
)
torch.cuda.set_device(self.runtime_rank %
self.encoder_runtime_mapping.gpus_per_node)
self.device = torch.cuda.current_device()
# Deserialize engine from engine directory
self.encoder_serialize_path = os.path.join(self.engine_dir, "encoder",
self.encoder_engine_name)
with open(self.encoder_serialize_path, "rb") as f:
encoder_engine_buffer = f.read()
assert encoder_engine_buffer is not None
self.decoder_serialize_path = os.path.join(self.engine_dir, "decoder",
self.decoder_engine_name)
with open(self.decoder_serialize_path, "rb") as f:
decoder_engine_buffer = f.read()
assert decoder_engine_buffer is not None
# session setup
self.encoder_session = tensorrt_llm.runtime.Session.from_serialized_engine(
encoder_engine_buffer)
self.decoder_session = tensorrt_llm.runtime.GenerationSession(
self.decoder_model_config, decoder_engine_buffer,
self.decoder_runtime_mapping)
# Print context memory size for CI/CD to track.
context_mem_size = self.encoder_session.context_mem_size + self.decoder_session.context_mem_size
print(
f"Allocated {context_mem_size / 1048576.0:.2f} MiB for execution context memory."
)
def get_config(self):
if 'whisper' in self.model_name:
print(
f"[WARNING] whisper benchmark is input_len=1500, no text prompt, output_len=arbitrary"
)
for inlen, outlen in self.in_out_lens:
if (inlen > self.max_input_len or outlen > self.max_seq_len):
print(
f"[WARNING] check inlen({inlen}) <= max_inlen({self.max_input_len}) and "
f"outlen({outlen}) <= max_seqlen({self.max_seq_len}) failed, skipping."
)
continue
for batch_size in self.batch_sizes:
if batch_size > self.max_batch_size:
print(
f"[WARNING] check batch_size({batch_size}) "
f"<= max_batch_size({self.max_batch_size}) failed, skipping."
)
continue
for gpu_weights_percent in self.gpu_weights_percents:
yield (batch_size, inlen, outlen, gpu_weights_percent)
def set_weight_streaming(self, config):
gpu_weights_percent = config[3]
self.encoder_session._set_weight_streaming(gpu_weights_percent)
self.decoder_session.runtime._set_weight_streaming(gpu_weights_percent)
def prepare_inputs(self, config):
batch_size, encoder_input_len, output_len = config[0], config[
1], config[2]
attention_mask = None
whisper_decoder_encoder_input_lengths = None
outputs = {}
if 'whisper' in self.model_name:
# feature_len always fixed 3000 now
feature_len = 3000
encoder_input_ids = (torch.randint(
1, 100, (batch_size, self.n_mels, feature_len)).int().cuda())
encoder_input_lengths = torch.tensor([
encoder_input_ids.shape[2] // 2
for _ in range(encoder_input_ids.shape[0])
],
dtype=torch.int32,
device=self.device)
decoder_input_ids = (torch.randint(1, 100, (1, )).int().cuda())
decoder_input_ids = decoder_input_ids.repeat(
(encoder_input_ids.shape[0], 1))
output_list = [
TensorInfo('input_features', str_dtype_to_trt(self.dtype),
encoder_input_ids.shape),
TensorInfo('input_lengths', str_dtype_to_trt('int32'),
encoder_input_lengths.shape)
]
output_info = (self.encoder_session).infer_shapes(output_list)
outputs = {
t.name:
torch.empty(tuple(t.shape),
dtype=trt_dtype_to_torch(t.dtype),
device='cuda')
for t in output_info
}
whisper_decoder_encoder_input_lengths = torch.tensor(
[
outputs['encoder_output'].shape[1]
for x in range(outputs['encoder_output'].shape[0])
],
dtype=torch.int32,
device='cuda')
decoder_input_lengths = torch.tensor([
decoder_input_ids.shape[-1]
for _ in range(decoder_input_ids.shape[0])
],
dtype=torch.int32,
device='cuda')
cross_attention_mask = torch.ones([
outputs['encoder_output'].shape[0],
decoder_input_lengths.max() + output_len,
outputs['encoder_output'].shape[1]
]).int().cuda()
else:
encoder_input_ids = (torch.randint(
100, (batch_size, encoder_input_len)).int().cuda())
decoder_input_ids = torch.IntTensor([[self.decoder_start_token_id]
]).to(self.device)
decoder_input_ids = decoder_input_ids.repeat((batch_size, 1))
encoder_input_lengths = torch.tensor([encoder_input_len] *
batch_size,
dtype=torch.int32,
device=self.device)
decoder_input_lengths = torch.tensor([1] * batch_size,
dtype=torch.int32,
device=self.device)
if self.encoder_model_config.remove_input_padding:
encoder_input_ids = torch.flatten(encoder_input_ids)
decoder_input_ids = torch.flatten(decoder_input_ids)
# attention mask, always set 1 as if all are valid tokens
attention_mask = torch.ones(
(batch_size, encoder_input_len)).int().cuda()
# cross attention mask, always set 1 as if all are valid tokens
# [batch_size, query_len, encoder_input_len] currently, use query_len=1
cross_attention_mask = [
torch.ones(decoder_input_lengths.max() + output_len,
encoder_input_len).int().cuda()
for _ in range(batch_size)
]
hidden_size = (self.encoder_model_config.hidden_size *
self.world_size) # tp_size
hidden_states_shape = (
encoder_input_ids.shape[0],
hidden_size,
) if self.encoder_model_config.remove_input_padding else (
encoder_input_ids.shape[0],
encoder_input_ids.shape[1],
hidden_size,
)
hidden_states_dtype = lambda name: trt_dtype_to_torch(
self.encoder_session.engine.get_tensor_dtype(name))
outputs["encoder_output"] = torch.empty(
hidden_states_shape,
dtype=hidden_states_dtype("encoder_output"),
device=self.device,
).contiguous()
stream = torch.cuda.current_stream().cuda_stream
return (
encoder_input_ids,
encoder_input_lengths,
attention_mask,
decoder_input_ids,
decoder_input_lengths,
cross_attention_mask,
whisper_decoder_encoder_input_lengths,
outputs,
stream,
)
def run(self, inputs, config, benchmark_profiler=None):
output_len = config[2]
(
encoder_input_ids,
encoder_input_lengths,
attention_mask,
decoder_input_ids,
decoder_input_lengths,
cross_attention_mask,
whisper_decoder_encoder_input_lengths,
outputs,
stream,
) = inputs
hidden_states_dtype = lambda name: trt_dtype_to_torch(
self.encoder_session.engine.get_tensor_dtype(name))
# input tensors
inputs = {}
if 'whisper' in self.model_name:
inputs['input_features'] = encoder_input_ids.contiguous()
inputs["input_lengths"] = encoder_input_lengths
else:
inputs["input_ids"] = encoder_input_ids.contiguous()
inputs["input_lengths"] = encoder_input_lengths
inputs["max_input_length"] = torch.empty(
(self.max_input_len, ),
dtype=hidden_states_dtype("max_input_length"),
device=self.device,
).contiguous()
if not self.encoder_model_config.gpt_attention_plugin:
inputs["attention_mask"] = attention_mask.contiguous()
if self.encoder_model_config.has_position_embedding:
bsz, seq_len = encoder_input_ids.shape[:2]
position_ids = torch.arange(
seq_len, dtype=torch.int32,
device=encoder_input_ids.device).expand(bsz, -1)
inputs['position_ids'] = position_ids.contiguous()
# run encoder
self.encoder_session.set_shapes(inputs)
ok = self.encoder_session.run(inputs, outputs, stream)
assert ok, "Runtime execution failed"
torch.cuda.synchronize()
# run decoder
sampling_config = tensorrt_llm.runtime.SamplingConfig(
end_id=1, pad_id=0, num_beams=self.num_beams, min_length=output_len)
encoder_output = outputs["encoder_output"]
encoder_max_input_length = encoder_output.shape[
1] if 'whisper' in self.model_name else torch.max(
encoder_input_lengths).item()
self.decoder_session.setup(
decoder_input_lengths.size(0),
torch.max(decoder_input_lengths).item(),
output_len,
beam_width=self.num_beams,
max_attention_window_size=None,
encoder_max_input_length=encoder_max_input_length,
)
self.decoder_session.decode(
decoder_input_ids,
decoder_input_lengths,
sampling_config,
encoder_output=encoder_output,
encoder_input_lengths=whisper_decoder_encoder_input_lengths
if 'whisper' in self.model_name else encoder_input_lengths,
cross_attention_mask=cross_attention_mask,
)
def report(self,
config,
latency,
percentile95,
percentile99,
peak_gpu_used,
csv,
benchmark_profiler=None):
# Note: Theoretically, the encoder and decoder can have different configs.
# But for current implementation, we assume they are the same. In the future,
# we can have a special structure of report_dict for enc-dec models.
report_dict = super().get_report_dict()
batch_size, encoder_input_len, output_len = config[0], config[
1], config[2]
tokens_per_sec = round(batch_size * output_len / (latency / 1000), 2)
report_dict["num_heads"] = self.encoder_model_config.num_heads
report_dict["num_kv_heads"] = self.encoder_model_config.num_kv_heads
report_dict["num_layers"] = self.encoder_model_config.num_layers
report_dict["hidden_size"] = self.encoder_model_config.hidden_size
report_dict["vocab_size"] = self.encoder_model_config.vocab_size
report_dict["batch_size"] = batch_size
report_dict["input_length"] = encoder_input_len
report_dict["output_length"] = output_len
report_dict["gpu_weights_percent"] = config[3]
report_dict["latency(ms)"] = latency
report_dict["build_time(s)"] = self.build_time
report_dict["tokens_per_sec"] = tokens_per_sec
report_dict["percentile95(ms)"] = percentile95
report_dict["percentile99(ms)"] = percentile99
report_dict["gpu_peak_mem(gb)"] = peak_gpu_used
if self.runtime_rank == 0:
if csv:
line = ",".join([str(v) for v in report_dict.values()])
print(line)
with open(self.get_csv_filename(), "a") as file:
file.write(line + "\n")
else:
kv_pairs = [f"{k} {v}" for k, v in report_dict.items()]
line = "[BENCHMARK] " + " ".join(kv_pairs)
print(line)

View File

@ -1,291 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
from math import ceil
import pandas as pd
import tensorrt as trt
import torch
import tensorrt_llm
from tensorrt_llm.bindings import KVCacheType
from tensorrt_llm.builder import Engine
from tensorrt_llm.runtime import (ChatGLMGenerationSession, GenerationSession,
SamplingConfig)
from base_benchmark import BaseBenchmark # isort:skip
def element_size(dtype: str):
str_to_size_in_bytes = dict(float16=2,
float32=4,
int64=8,
int32=4,
int8=1,
bool=1,
bfloat16=2,
fp8=1)
return str_to_size_in_bytes[dtype]
class GPTBenchmark(BaseBenchmark):
def __init__(self, args, batch_sizes, in_out_lens, gpu_weights_percents,
rank, world_size):
super().__init__(args.engine_dir, args.model, args.dtype, rank,
world_size)
self.batch_sizes = batch_sizes
self.in_out_lens = in_out_lens
self.gpu_weights_percents = gpu_weights_percents
self.num_beams = args.num_beams
self.cuda_graph_mode = args.enable_cuda_graph
self.dump_layer_info = args.dump_layer_info
# Get build configs from engine directory is done in base class
# Deserialize engine from engine directory
engine = Engine.from_dir(args.engine_dir, rank)
engine_buffer = engine.engine
assert engine_buffer is not None
pretrained_config = engine.config.pretrained_config
if pretrained_config.architecture == 'ChatGLMForCausalLM' and pretrained_config.chatglm_version in [
'glm', 'chatglm'
]:
session_cls = ChatGLMGenerationSession
else:
session_cls = GenerationSession
if not hasattr(self, 'num_kv_heads') or self.num_kv_heads is None:
self.num_kv_heads = self.num_heads
rnn_config_items = [
'conv_kernel', 'layer_types', 'rnn_hidden_size', 'state_size',
'state_dtype', 'rnn_head_size', 'rnn_conv_dim_size'
]
rnn_configs_kwargs = {}
for item in rnn_config_items:
if hasattr(self, item):
rnn_configs_kwargs[item] = getattr(self, item)
kv_cache_type = KVCacheType.CONTINUOUS
if hasattr(self, 'kv_cache_type'):
kv_cache_type = KVCacheType(self.kv_cache_type)
else:
if hasattr(self, 'paged_kv_cache'):
kv_cache_type = KVCacheType.PAGED if self.paged_kv_cache == True else KVCacheType.CONTINUOUS
model_config = tensorrt_llm.runtime.ModelConfig(
max_batch_size=self.max_batch_size,
max_beam_width=self.num_beams,
vocab_size=self.vocab_size,
num_layers=self.num_layers,
num_heads=self.num_heads // self.world_size,
num_kv_heads=ceil(self.num_kv_heads / self.world_size),
hidden_size=self.hidden_size // self.world_size,
gpt_attention_plugin=self.use_gpt_attention_plugin,
kv_cache_type=kv_cache_type,
paged_state=self.paged_state
if hasattr(self, 'paged_state') else False,
dtype=self.dtype,
remove_input_padding=self.remove_input_padding,
quant_mode=self.quant_mode,
tokens_per_block=self.tokens_per_block if hasattr(
self, 'tokens_per_block') else 32,
mamba_conv1d_plugin=self.use_mamba_conv1d_plugin,
gpu_weights_percent=list(sorted(gpu_weights_percents))[0],
**rnn_configs_kwargs,
)
self.sampling_config = SamplingConfig(end_id=2, pad_id=0)
self.decoder = session_cls(model_config,
engine_buffer,
self.runtime_mapping,
cuda_graph_mode=self.cuda_graph_mode)
# Print context memory size for CI/CD to track.
context_mem_size = self.decoder.context_mem_size
print(
f"Allocated {context_mem_size / 1048576.0:.2f} MiB for execution context memory."
)
def get_config(self):
for inlen, outlen in self.in_out_lens:
if inlen > self.max_input_len or inlen + outlen > self.max_seq_len:
print(
f'[WARNING] check inlen({inlen}) <= max_inlen({self.max_input_len}) or '
f'seqlen({inlen + outlen}) <= max_seq_len({self.max_seq_len}) failed, skipping.'
)
continue
for batch_size in self.batch_sizes:
if batch_size > self.max_batch_size:
print(
f'[WARNING] check batch_size({batch_size}) '
f'<= max_batch_size({self.max_batch_size}) failed, skipping.'
)
continue
for gpu_weights_percent in self.gpu_weights_percents:
yield (batch_size, inlen, outlen, gpu_weights_percent)
def set_weight_streaming(self, config):
gpu_weights_percent = config[3]
self.decoder.runtime._set_weight_streaming(gpu_weights_percent)
def prepare_inputs(self, config):
batch_size, inlen, outlen = config[0], config[1], config[2]
input_ids = torch.randint(100, (batch_size, inlen)).int().cuda()
input_lengths = torch.tensor([inlen
for _ in range(batch_size)]).int().cuda()
self.decoder.setup(batch_size, inlen, outlen, beam_width=self.num_beams)
return (input_ids, input_lengths)
def get_report_dict(self, benchmark_profiler=None):
report_dict = super().get_report_dict(
benchmark_profiler=benchmark_profiler)
if benchmark_profiler is not None:
report_dict["generation_time(ms)"] = None
report_dict["total_generated_tokens"] = None
report_dict["generation_tokens_per_second"] = None
return report_dict
def run(self, inputs, config, benchmark_profiler=None):
batch_size, inlen, outlen = config[0], config[1], config[2]
self.decoder.setup(batch_size, inlen, outlen, beam_width=self.num_beams)
if self.remove_input_padding:
self.decoder.decode_batch(inputs[0],
self.sampling_config,
benchmark_profiler=benchmark_profiler)
else:
self.decoder.decode(inputs[0],
inputs[1],
self.sampling_config,
benchmark_profiler=benchmark_profiler)
torch.cuda.synchronize()
def report(self,
config,
latency,
percentile95,
percentile99,
peak_gpu_used,
csv,
benchmark_profiler=None):
report_dict = super().get_report_dict()
batch_size, inlen, outlen, gpu_weights_percent = config[0], config[
1], config[2], config[3]
tokens_per_sec = round(batch_size * outlen / (latency / 1000), 2)
report_dict["num_heads"] = self.num_heads
report_dict["num_kv_heads"] = self.num_kv_heads
report_dict["num_layers"] = self.num_layers
report_dict["hidden_size"] = self.hidden_size
report_dict["vocab_size"] = self.vocab_size
report_dict["batch_size"] = batch_size
report_dict["gpu_weights_percent"] = gpu_weights_percent
report_dict["input_length"] = inlen
report_dict["output_length"] = outlen
report_dict["latency(ms)"] = latency
report_dict["tokens_per_sec"] = tokens_per_sec
report_dict["percentile95(ms)"] = percentile95
report_dict["percentile99(ms)"] = percentile99
report_dict["gpu_peak_mem(gb)"] = peak_gpu_used
if benchmark_profiler is not None:
iter_count = benchmark_profiler.get_aux_info('iter_count')
generation_time_ms = benchmark_profiler.get_timer_value(
'generation_time')
generation_step_count = benchmark_profiler.get_aux_info(
'generation_step_count')
token_per_step = batch_size * self.num_beams
total_tokens = generation_step_count * token_per_step
report_dict["generation_time(ms)"] = round(
generation_time_ms / iter_count, 3)
report_dict["total_generated_tokens"] = total_tokens / iter_count
tokens_per_second = round(
total_tokens * 1000.0 / generation_time_ms, 3)
report_dict["generation_tokens_per_second"] = tokens_per_second
if self.runtime_rank == 0:
if csv:
line = ",".join([str(v) for v in report_dict.values()])
print(line)
with open(self.get_csv_filename(), "a") as file:
file.write(line + "\n")
else:
kv_pairs = [f"{k} {v}" for k, v in report_dict.items()]
line = '[BENCHMARK] ' + " ".join(kv_pairs)
print(line)
if self.dump_layer_info:
engine_inspector = self.decoder.engine_inspector
inspector_result = engine_inspector.get_engine_information(
trt.LayerInformationFormat.JSON)
json_result = json.loads(inspector_result)
layers = json_result["Layers"]
for layer_idx, _ in enumerate(layers):
layer_info = engine_inspector.get_layer_information(
layer_idx, trt.LayerInformationFormat.ONELINE)
print(layer_info)
def report_profiler(self, benchmark_profiler=None):
if benchmark_profiler is not None and benchmark_profiler.is_recording_perf_profile:
perf_profile_data = self.decoder.profiler.results
if not perf_profile_data:
tensorrt_llm.logger.error("profiler data is empty")
return
ctx_layers = list()
generation_layers = list()
start = 0
ctx_iter_cnt = 0
generation_iter_cnt = 0
# split context/generations layer information
for idx, layer_info in enumerate(perf_profile_data):
if layer_info[0] == "step":
if layer_info[1] == 0:
ctx_layers.extend(perf_profile_data[start:idx])
ctx_iter_cnt += 1
else:
generation_layers.extend(perf_profile_data[start:idx])
generation_iter_cnt += 1
start = idx + 1
# Reduce all data
def reduce_layer_data(layers):
layer_infos = dict()
for layer in layers:
if layer[0] in layer_infos:
layer_infos[layer[0]] += layer[1]
else:
layer_infos[layer[0]] = layer[1]
return layer_infos
# Dump kernel data
def dump_kernel_profile_table(name: str, profile_data: list,
iter_cnt: int):
table = pd.DataFrame(
[['{:0.3f}'.format(v), k]
for k, v in profile_data.items() if v != 0.0],
columns=['times (ms)', '{} Phase LayerName'.format(name)])
def ljust(s):
s = s.astype(str).str.strip()
return s.str.ljust(s.str.len().max())
print(table.apply(ljust).to_string(index=False, justify='left'))
print("{} phase step iter: {}".format(name, iter_cnt))
ctx_layer_infos = reduce_layer_data(ctx_layers)
generation_layer_infos = reduce_layer_data(generation_layers)
dump_kernel_profile_table("Context", ctx_layer_infos, ctx_iter_cnt)
dump_kernel_profile_table("Generation", generation_layer_infos,
generation_iter_cnt)

View File

@ -1,38 +0,0 @@
# Benchmark Multi-user Multi-round Serving with Llama-3.1-70B
## Overview
This benchmark is a multi-user, multi-round serving system designed to handle interactions with multiple users simultaneously, enabling a sequence of requests and responses in multiple rounds per user. It is suitable for applications like chatbots, customer support systems, or other interactive services where stateful conversations are required.
#### Application Setup
Each user is assigned a unique long context prompt consisting of 16,000 tokens with precomputed kv_cache.
* First Round: The input includes the 16,000-token context prompt and an additional 64 new input tokens. The output length is limited to 64 tokens.
* Subsequent Rounds: The input is formed by combining the previous input, the output tokens from the last round, and 64 new input tokens. The output length is limited to 64 tokens.
#### Benchmark Features
This benchmark leverages kv_cache reuse and allocates host (CPU) memory as a secondary pool for kv_cache blocks. It measures the end-to-end runtime of 10 rounds, with user requests processed in a round-robin fashion. As the number of users increases, the kv_cache footprints exceed the GPU memory capacity. In such cases, less recently used cache blocks are offloaded to CPU memory and brought back to the GPU as needed for subsequent rounds.
Additionally, the benchmark tracks the Time to First Token (TTFT). Since each users long context prompt has precomputed kv_cache, a new request can reuse this cache while processing the additional input tokens, ensuring efficient response generation of the first output token.
#### Comparing GH200 and H100
This benchmark highlights the potential of the NVIDIA GH200 in comparison to the H100. The GH200 utilizes NVIDIA NVLink-C2C to provide a CPU+GPU coherent memory model with 900 gigabytes per second (GB/s) memcpy throughput, which is 7x faster than the H100 connected via PCIe Gen5. GH200 also has larger on GPU memory. GH200 offers configurations of 96 GB or 144 GB while is equipped with 80 GB of GPU memory.
## Performance Comparison
> NOTE: GH200 with 96 GB on GPU memory is used to generate the below results.
![result](./comparison.jpg)
#### On-GPU kv_cache Storage:
The H100 can support 2 concurrent users with on-GPU kv_cache storage, whereas the GH200 can support 7 concurrent users, leveraging its larger GPU memory capacity.
#### User Size = 2:
At a user size of 2, kv_cache is fully stored in GPU memory for both H100 and GH200. Performance improvements in this scenario are unrelated to NVLink-C2C or the larger GPU memory size of the GH200.
#### User Sizes 3 to 7:
The H100 must offload kv_cache to CPU memory and transfer precomputed blocks back to GPU when needed. This additional memory transfer introduces latency due to the slower communication between CPU and GPU. GH200 can handle kv_cache entirely in GPU memory, eliminating the need for memory transfers. Thus, GH200s performance improvement peaks at a user size of 7.
#### User Size > 7:
GH200 needs to utilize the CPU memory pool of kv_cache. The latency added is much less than H100 due to faster communication between CPU and GPU. GH200 delivers a 1.9x improvement in Time to First Token (TTFT) and approximately 3x improvement in end-to-end runtime over 10 rounds compared to the H100.
## Reproduction
Use **run.sh** to reproduce the benchmark.

View File

@ -1,191 +0,0 @@
import argparse
import datetime
import json
import random
import time
import tensorrt_llm.bindings.executor as trtllm
output_config = trtllm.OutputConfig()
output_config.exclude_input_from_output = False
sampling_config = trtllm.SamplingConfig(1)
def generate_random_tokens(rounds=10, count=64) -> list[list[int]]:
ret = []
for i in range(rounds):
ret.append([random.randint(0, 1000) for _ in range(count)])
return ret
# Read input tokens from json file
def read_input_json(input_dataset_path: str,
num_users) -> tuple[list[list[int]], list[int]]:
with open(input_dataset_path, "r") as f:
data = json.load(f)
input_tokens = []
output_lens = []
for n in range(num_users):
sample = data["samples"][n]
input_tokens.append(sample["input_ids"])
output_lens.append(sample["output_len"])
return input_tokens, output_lens
# Prepare and enqueue the requests
def enqueue_requests(args: argparse.Namespace, executor: trtllm.Executor,
input_tokens) -> list[int]:
request_ids = []
for tokens in input_tokens:
req = trtllm.Request(input_token_ids=tokens,
max_tokens=args.output_len,
streaming=False,
sampling_config=sampling_config,
output_config=output_config)
req_id = executor.enqueue_request(req)
request_ids.append(req_id)
return request_ids
def get_TTFT(stats_queue):
iter_latency = []
cache_hit_rates = []
for stats in stats_queue:
iter_latency.append(stats.iter_latency_ms)
cache_hit_rates.append(stats.kv_cache_stats.cache_hit_rate)
TTFT_idx = [i for i, x in enumerate(cache_hit_rates) if x > 0.01][1]
return iter_latency[TTFT_idx]
# Wait for responses and store output tokens
def wait_for_responses(args: argparse.Namespace, request_ids: list[int],
executor: trtllm.Executor) -> list[list[int]]:
output_tokens = {req_id: [] for req_id in request_ids}
num_finished = 0
iterations = 0
while (num_finished < len(request_ids) and iterations < args.timeout_ms):
responses = executor.await_responses(
datetime.timedelta(milliseconds=args.timeout_ms))
for response in responses:
req_id = response.request_id
if not response.has_error():
result = response.result
num_finished += 1 if result.is_final else 0
for _, outTokens in enumerate(result.output_token_ids):
output_tokens[req_id].extend(outTokens)
else:
raise RuntimeError(
str(req_id) + " encountered error:" + response.error_msg)
return list(output_tokens.values())
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Executor Bindings Example")
parser.add_argument("--n", type=int, required=True, help="Number of users")
parser.add_argument("--free_gpu_memory_fraction",
required=False,
type=float,
default=0.9,
help="free_gpu_memory_fraction")
parser.add_argument("--kv_host_cache_bytes",
required=False,
type=int,
default=55000000000,
help="host_cache_size")
parser.add_argument("--model_path",
type=str,
required=True,
help="Directory containing model engine")
parser.add_argument("--input_dataset_path",
type=str,
required=True,
help="Text file containing the input tokens")
parser.add_argument("--beam_width",
type=int,
required=False,
default=1,
help="The beam width")
parser.add_argument("--streaming",
default=False,
action="store_true",
help="Operate in streaming mode")
parser.add_argument("--output_len",
type=int,
required=False,
default=64,
help="The number of tokens to be generated for output.")
parser.add_argument("--rounds",
type=int,
required=False,
default=10,
help="How many runs of user input to run.")
parser.add_argument(
"--timeout_ms",
type=int,
required=False,
default=10000,
help="The maximum time to wait for all responses, in milliseconds")
parser.add_argument(
"--log_iteration_data",
action='store_true',
help="Print the verbose iteration status data (default: False).")
args = parser.parse_args()
kv_cache_config = trtllm.KvCacheConfig(
enable_block_reuse=True,
free_gpu_memory_fraction=args.free_gpu_memory_fraction,
host_cache_size=args.kv_host_cache_bytes)
executor_config = trtllm.ExecutorConfig(args.beam_width,
kv_cache_config=kv_cache_config)
# Create the executor.
executor = trtllm.Executor(args.model_path, trtllm.ModelType.DECODER_ONLY,
executor_config)
new_inputs = [generate_random_tokens(args.rounds) for _ in range(args.n)]
stats_queue = []
if executor.can_enqueue_requests():
## Process long context to generate kvcache
context_tokens, _ = read_input_json(args.input_dataset_path, args.n)
# Enqueue the requests
request_ids = enqueue_requests(args, executor, context_tokens)
# Wait for the responses
output_tokens = wait_for_responses(args, request_ids, executor)
stats_queue.extend(executor.get_latest_iteration_stats())
# Start the multi-turn runs
## Start timing
start_time = time.time()
for r in range(args.rounds):
current_input_tokens = [
output_tokens[i] + new_inputs[i][r] for i in range(args.n)
]
# Enqueue the requests
request_ids = enqueue_requests(args, executor, current_input_tokens)
# Wait for the responses
output_tokens = wait_for_responses(args, request_ids, executor)
stats_queue.extend(executor.get_latest_iteration_stats())
## End timing
end_time = time.time()
elapsed_time = (end_time - start_time) * 1000
print(f"E2E TIME: {elapsed_time:.2f} (ms)")
print(f"TTFT: {get_TTFT(stats_queue)} (ms)")
if args.log_iteration_data:
for stats in stats_queue:
print(stats.to_json_str())

Binary file not shown.

Before

Width:  |  Height:  |  Size: 94 KiB

View File

@ -1,49 +0,0 @@
#!/bin/bash
# Check if the environment variable is set
if [[ -z "${HUGGING_FACE_HUB_TOKEN}" ]]; then
echo "The environment variable HUGGING_FACE_HUB_TOKEN is not set."
exit 1
fi
# Get GPU name using nvidia-smi
gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader)
GPU="GH200"
# Check if the GPU is a GH200
if echo "$gpu_name" | grep -q "GH200"; then
GPU="GH200"
else
GPU="H100"
fi
echo "Running with ${GPU}."
# Generate context prompts of 16,000 tokens for each user
python3 $(pwd)/../../cpp/prepare_dataset.py \
--output=$(pwd)/dataset.json \
--tokenizer=meta-llama/Llama-3.1-70B token-norm-dist \
--num-requests=20 \
--input-mean=16000 \
--output-mean=64 \
--input-stdev=0 \
--output-stdev=0
# Build the model
trtllm-bench --workspace $(pwd)/${GPU} \
--model meta-llama/Llama-3.1-70B \
build \
--max_batch_size 16 \
--max_num_tokens 17800 \
--max_seq_len 17800 \
--quantization FP8
# Run the benchmark script
for user_size in $(seq 2 16); do
echo "Run benchmark with user size = ${user_size}."
python3 benchmark.py \
--model_path $(pwd)/${GPU}/meta-llama/Llama-3.1-70B/tp_1_pp_1 \
--input_dataset_path dataset.json \
--n ${user_size}
done

View File

@ -1,92 +0,0 @@
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from multiprocessing import Event, Process, Queue
from queue import Empty
from tensorrt_llm.logger import logger
from tensorrt_llm.profiler import (MemUnitType, bytes_to_target_unit,
device_memory_info, host_memory_info)
class MemoryMonitor:
def __init__(self, query_interval=0.1, disable_host_mem_monitor=False):
self.query_interval = query_interval # second(s)
self.mem_monitor_process = None
# bytes
self._peak_host_memory = 0
self._peak_device_memory = 0
self.pid = os.getpid()
self.device_handles = {}
self.signal_event = Event() # Sending signal to subprocess
self.peak_mem_queue = Queue() # Receiving results from subprocess
self.disable_host_mem_monitor = disable_host_mem_monitor
def start(self):
self.mem_monitor_process = Process(target=self._upd_peak_memory_usage,
args=(self.signal_event,
self.peak_mem_queue))
self.mem_monitor_process.start()
logger.debug("Launched memory monitor subprocess.")
def kill(self):
if self.mem_monitor_process is not None:
self.mem_monitor_process.kill()
logger.debug("Memory monitor subprocess is killed.")
def stop(self):
self.signal_event.set()
logger.debug("Sent signal to stop memory monitor subprocess.")
try:
peak_mem_use = self.peak_mem_queue.get(timeout=20)
except Empty:
logger.warning("peak_mem_queue was empty.")
else:
self._peak_host_memory = max(self._peak_host_memory,
peak_mem_use[0])
self._peak_device_memory = max(self._peak_device_memory,
peak_mem_use[1])
self.mem_monitor_process.join(timeout=20)
self.mem_monitor_process = None
logger.debug("Memory monitor subprocess joined.")
self.peak_mem_queue.close()
self.peak_mem_queue.join_thread()
logger.debug("Peak memory queue closed and joined.")
def _upd_peak_memory_usage(self, signal_event, peak_mem_queue):
peak_host_used, peak_device_used = self.get_memory_usage()
while not signal_event.is_set():
host_used, device_used = self.get_memory_usage()
peak_host_used = max(host_used, peak_host_used)
peak_device_used = max(device_used, peak_device_used)
peak_mem_queue.put((peak_host_used, peak_device_used))
def get_memory_usage(self):
if self.disable_host_mem_monitor:
host_used = 0
else:
host_used, _, _ = host_memory_info(self.pid)
device_used, _, _ = device_memory_info()
return host_used, device_used
def get_peak_memory_usage(self, unit: MemUnitType = 'GiB'):
return bytes_to_target_unit(self._peak_host_memory, unit), \
bytes_to_target_unit(self._peak_device_memory, unit)

View File

@ -38,18 +38,6 @@ python3 examples/summarize.py \
```
We can also benchmark the efficiency of Weight Streaming. Here is an example:
```bash
python3 benchmarks/python/benchmark.py \
--engine_dir /tmp/llama_7b/trt_engines/fp16/1-gpu/ \
--batch_size "1;32" \
--input_output_len "256,32" \
--gpu_weights_percent "0.0;0.3;0.6;1.0" \
--dtype float16 \
--csv \
--log_level verbose
```
### API Changes

View File

@ -18,12 +18,12 @@ This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in T
- [Run Python runtime](#run-python-runtime)
- [Benchmark](#benchmark)
- [Benchmark C++ runtime](#benchmark-c-runtime)
- [Benchmark Python runtime](#benchmark-python-runtime)
- [Run BART with LoRA](#run-bart-with-lora)
- [Reminders](#reminders)
- [Attention Scaling Factors](#attention-scaling-factors)
- [Run FairSeq NMT (Neural Machine Translation) models](#run-fairseq-nmt-neural-machine-translation-models)
- [FP8 Post-Training Quantization](#fp8-post-training-quantization)
- [Get quantized checkpoint with ModelOpt](#get-quantized-checkpoint-with-modelopt)
## Overview
@ -241,31 +241,6 @@ mpirun --allow-run-as-root -np ${WORLD_SIZE} python3 run.py --engine_dir tmp/trt
The tutorial for encoder-decoder C++ runtime benchmark can be found in [`benchmarks/cpp`](../../benchmarks/cpp/README.md#2-launch-c-benchmarking-inflightv1-batching)
#### Benchmark Python runtime
The benchmark implementation and entrypoint can be found in [`benchmarks/python/benchmark.py`](../../benchmarks/python/benchmark.py). Specifically, [`benchmarks/python/enc_dec_benchmark.py`](../../benchmarks/python/enc_dec_benchmark.py) is the benchmark script for Encoder-Decoder models.
In `benchmarks/python/`:
```bash
# Example 1: Single-GPU benchmark
python benchmark.py \
-m enc-dec \
--batch_size "1;8" \
--input_output_len "60,20;128,20" \
--engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION} \
--dtype float32 \
--csv # optional
# Example 2: Multi-GPU benchmark
mpirun --allow-run-as-root -np 4 python benchmark.py \
-m enc-dec \
--batch_size "1;8" \
--input_output_len "60,20;128,20" \
--engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION} \
--dtype float32 \
--csv # optional
```
### Run BART with LoRA

View File

@ -28,14 +28,9 @@
"accuracy/test_cli_flow.py::TestLlama3_8BInstruct::test_nvfp4": 286.4440165119886,
"perf/test_perf.py::test_perf[bert_base-cpp-ootb-float16-bs:32-input_len:32]": 111.37450777366757,
"perf/test_perf.py::test_perf[bert_base-cpp-plugin-float16-bs:32-input_len:32]": 95.00738414749503,
"perf/test_perf.py::test_perf[bert_base-ootb-float16-bs:32-input_len:32]": 132.52322902716696,
"perf/test_perf.py::test_perf[bert_base-plugin-float16-bs:32-input_len:32]": 114.33938522078097,
"perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-bs:32-input_output_len:60": 99.74059158749878,
"perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-gwp:0.0-bs:32-input_output_len:60": 98.94526879303157,
"perf/test_perf.py::test_perf[gpt_350m-cppmanager-static_batching-plugin_ifb-float16-bs:32-input_output_len:60": 100.77929892018437,
"perf/test_perf.py::test_perf[gpt_350m-ootb-float16-bs:32-input_output_len:60": 170.83428032323718,
"perf/test_perf.py::test_perf[gpt_350m-ootb-float16-gwp:0.5-bs:32-input_output_len:60": 173.8481143657118,
"perf/test_perf.py::test_perf[gpt_350m-plugin-float16-bs:32-input_output_len:60": 217.20630648359656,
"perf/test_perf.py::test_perf[roberta_base-cpp-plugin-float16-bs:32-input_len:128+512]": 140.2516261599958,
"accuracy/test_cli_flow.py::TestGemma2_9BIt::test_auto_dtype": 725.8308991710655,
"accuracy/test_cli_flow.py::TestGpt2::test_attention_ootb": 448.54090467840433,
@ -61,7 +56,6 @@
"examples/test_multimodal.py::test_llm_multimodal_general[fuyu-8b-pp:1-tp:1-float16-bs:1-cpp_e2e:True-nb:1]": 492.22362083010375,
"examples/test_multimodal.py::test_llm_multimodal_general[kosmos-2-pp:1-tp:1-float16-bs:1-cpp_e2e:True-nb:1]": 333.81485258904286,
"examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb5-bs8]": 411.88197461143136,
"test_e2e.py::test_benchmark_sanity_enable_fp8[gpt_350m]": 246.73502164706588,
"test_unittests.py::test_unittests_v2[unittest/trt/model_api/test_model_quantization.py]": 493.8186915554106,
"accuracy/test_cli_flow.py::TestGpt2::test_beam_search_large": 730.1395341157913,
"accuracy/test_cli_flow.py::TestVicuna7B::test_eagle[cuda_graph=False-chunked_context=False-typical_acceptance=False]": 422.75362031999975,
@ -118,7 +112,6 @@
"examples/test_redrafter.py::test_llm_redrafter_1gpu[use_cpp_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb5-bs8]": 386.68252966180444,
"examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb8-bs8]": 429.239758990705,
"examples/test_whisper.py::test_llm_whisper_general[large-v3-disable_gemm_plugin-disable_attention_plugin-disable_weight_only-float16-nb:1-use_python_runtime]": 327.95307156071067,
"test_e2e.py::test_benchmark_sanity_enable_fp8[llama_7b]": 253.08591708587483,
"test_e2e.py::test_build_time_benchmark_sanity": 165.71592589840293,
"test_unittests.py::test_unittests_v2[unittest/trt/attention/test_bert_attention.py]": 99.96196278184652,
"cpp/test_e2e.py::test_benchmarks[gpt-80]": 1376.0404928650241,

View File

@ -68,7 +68,7 @@ class SanityPerfCheck():
cleaned_options = []
for option in options:
# Truncate workspace dir
if "build.py" in option or "benchmark.py" in option or "SessionBenchmark.cpp" in option:
if "build.py" in option or "SessionBenchmark.cpp" in option:
cleaned_options.append("/".join(
option.split("/")[-5:]))
# Remove engine_dir as it is not useful

View File

@ -481,11 +481,9 @@ class PerfTestConfig:
labels = test_param_labels.split("-")
self.model_name = labels.pop(0)
self.runtime = "python" if labels[0] not in [
"cpp",
"cppmanager",
"bench",
] else labels.pop(0)
assert labels[0] in ["cpp", "cppmanager", "bench"], \
f"Invalid runtime {labels[0]}!"
self.runtime = labels.pop(0)
self.api = labels.pop(0) if labels[0] == "exe" else ""
self.backend = labels.pop(0) if labels[0] == "pytorch" else ""
self.streaming = labels.pop(0) if labels[0] == "streaming" else ""
@ -592,7 +590,7 @@ class PerfTestConfig:
assert self.model_name in allowed_models, f"model_name {self.model_name} is not in allowed_models!"
# Validate runtime type.
VALID_RUNTIMES = ["cpp", "cppmanager", "python", "bench"]
VALID_RUNTIMES = ["cpp", "cppmanager", "bench"]
assert self.runtime in VALID_RUNTIMES, f"Invalid runtime {self.runtime}!"
# Validate plugin mode.
@ -775,8 +773,7 @@ class MultiMetricPerfTest(AbstractPerfScriptTestClass):
elif self._config.runtime == "bench":
benchmark_script = "trtllm-bench"
else:
benchmark_script = os.path.join(llm_root, "benchmarks", "python",
"benchmark.py")
raise RuntimeError(f"Invalid runtime {self._config.runtime}.")
allowed_configs = import_allowed_perf_config()
allowed_models = allowed_configs.get_allowed_models()
if self._config.runtime == "bench":

View File

@ -25,13 +25,13 @@ import pytest
import yaml
from defs.common import convert_weights
from defs.trt_test_alternative import (check_call, check_call_negative_test,
check_output, exists, makedirs)
check_output)
from .common import (PluginOptions, convert_weights, prune_checkpoint,
quantize_data, refit_model, venv_check_call)
from .conftest import (llm_models_root, skip_nvlink_inactive,
skip_post_blackwell, skip_pre_ada, skip_pre_blackwell,
skip_pre_hopper, tests_path, unittest_path)
skip_post_blackwell, skip_pre_blackwell, skip_pre_hopper,
tests_path, unittest_path)
sys.path.append(os.path.join(str(tests_path()), '/../examples/apps'))
@ -742,79 +742,6 @@ def test_trtllm_bench_iteration_log(llm_root, llm_venv, model_name,
shutil.rmtree(engine_dir, ignore_errors=True)
@pytest.mark.parametrize("model_name", [
"gpt_350m", "gpt_350m_sq_per_tensor", "llama_70b", "bert_base",
"falcon_40b", "t5_base", "roberta_base"
],
ids=lambda x: x.strip("-"))
def test_benchmark_sanity(llm_root, llm_venv, model_name, engine_dir):
'''
sanity check on the benchmark script to make sure it works
- gpt_350m for gpt baseline.
- gpt_350m_sq_per_tensor for testing SQ
- llama_70b for GQA (num_kv_heads < num_heads) in gpt benchmark script.
- bert_base for bert baseline.
- t5_base for t5 baseline.
'''
build_script_root = os.path.join(llm_root, "tests/integration/defs/perf")
benchmark_root = os.path.join(llm_root, "benchmarks", "python")
engine_dir = os.path.join(engine_dir, model_name, "benchmark-sanity")
if not exists(engine_dir):
makedirs(engine_dir)
# max batch size 256 (default) is OOM on A30, changing to a smaller one to just test sanity
build_args = f"-m {model_name} --force_num_layer_1 --max_input_len 512 --max_batch_size 8"
# test OOTB path in one of the model
if model_name == "gpt_350m":
build_args += " --mode ootb"
build_cmd = f'{build_script_root}/build.py --output_dir {engine_dir} {build_args}'.split(
" ")
benchmark_args = f"--batch_size 1;2 --duration 0 --num_runs 1"
if 'bert' in model_name:
benchmark_args += " --input_len 20;60"
benchmark_args += " --m enc"
else:
benchmark_args += " --input_output_len 20,60;60,20"
if 't5' in model_name or 'roberta' in model_name:
benchmark_args += " --m enc-dec"
load_cmd = f'{benchmark_root}/benchmark.py --engine_dir {engine_dir} {benchmark_args}'.split(
" ")
venv_check_call(llm_venv, build_cmd)
venv_check_call(llm_venv, load_cmd)
@skip_pre_ada
@pytest.mark.parametrize("model_name",
["llama_7b", "gptj_6b", "gpt_350m", "falcon_40b"],
ids=lambda x: x.strip("-"))
def test_benchmark_sanity_enable_fp8(llm_root, llm_venv, model_name,
engine_dir):
'''
sanity check on the benchmark script to make sure it works
'''
build_script_root = os.path.join(llm_root, "tests/integration/defs/perf")
benchmark_root = os.path.join(llm_root, "benchmarks", "python")
engine_dir = os.path.join(engine_dir, model_name, "benchmark-sanity")
if not exists(engine_dir):
makedirs(engine_dir)
build_args = f"-m {model_name} --force_num_layer_1 --quantization fp8"
build_cmd = f'{build_script_root}/build.py --output_dir {engine_dir} {build_args}'.split(
" ")
benchmark_args = f"--batch_size 1;2 --duration 0 --num_runs 1 --quantization fp8"
if 'bert' in model_name:
benchmark_args += " --input_len 20;60"
benchmark_args += " --m enc"
else:
benchmark_args += " --input_output_len 20,60;60,20"
load_cmd = f'{benchmark_root}/benchmark.py --engine_dir {engine_dir} {benchmark_args}'.split(
" ")
venv_check_call(llm_venv, build_cmd)
venv_check_call(llm_venv, load_cmd)
def test_chatglm_6b_sanity(chatglm_6b_example_root, llm_venv, cmodel_dir,
engine_dir):
llm_models = llm_models_root()

View File

@ -455,14 +455,6 @@ accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8_block_scales[laten
accuracy/test_disaggregated_serving.py::TestLlama3_1_8B::test_auto_dtype[False]
accuracy/test_disaggregated_serving.py::TestLlama3_1_8B::test_auto_dtype[True]
test_e2e.py::test_benchmark_sanity[bert_base] # 127.18s
test_e2e.py::test_benchmark_sanity[gpt_350m] # 64.06s
test_e2e.py::test_benchmark_sanity[gpt_350m_sq_per_tensor] # 97.04s
test_e2e.py::test_benchmark_sanity[llama_70b] # 91.93s
test_e2e.py::test_benchmark_sanity[roberta_base]
test_e2e.py::test_benchmark_sanity[t5_base]
test_e2e.py::test_benchmark_sanity_enable_fp8[gpt_350m]
test_e2e.py::test_benchmark_sanity_enable_fp8[llama_7b]
test_e2e.py::test_llama_e2e[use_cpp_session-remove_input_padding-]
test_e2e.py::test_llama_e2e[use_py_session-remove_input_padding-]
test_e2e.py::test_llama_e2e[use_py_session--]

View File

@ -146,8 +146,6 @@ l0_a10:
- examples/test_redrafter.py::test_llm_redrafter_1gpu[use_cpp_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb8-bs8]
- examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb5-bs8]
- examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb8-bs8]
- test_e2e.py::test_benchmark_sanity[bert_base]
- test_e2e.py::test_benchmark_sanity[roberta_base]
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba-130m-float16-disable_gemm_plugin] # 3 mins
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba2-130m-float16-disable_gemm_plugin]
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba-codestral-7B-v0.1-float16-disable_gemm_plugin] # 4 mins
@ -162,7 +160,6 @@ l0_a10:
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba-130m-float16-enable_gemm_plugin] # 2 mins
- llmapi/test_llm_e2e.py::test_llmapi_load_engine_from_build_command[llama-codellama/CodeLlama-7b-Instruct-hf] # 5min
- llmapi/test_llm_e2e.py::test_llmapi_load_ckpt_from_convert_command # 5min
- test_e2e.py::test_benchmark_sanity[t5_base]
- examples/test_openai.py::test_llm_openai_triton_1gpu
- examples/test_openai.py::test_llm_openai_triton_plugingen_1gpu
- test_e2e.py::test_build_time_benchmark_sanity

View File

@ -277,7 +277,5 @@ l0_h100:
- examples/test_gpt.py::test_llm_minitron_fp8_with_pseudo_loras[4b]
- examples/test_chatglm.py::test_llm_glm_4_9b_single_gpu_summary[glm-4-9b-disable_weight_only]
- unittest/trt/model_api/test_model_quantization.py # 20 mins on H100
- test_e2e.py::test_benchmark_sanity_enable_fp8[llama_7b] # 55.77s H100 only
- test_e2e.py::test_benchmark_sanity_enable_fp8[gpt_350m] # 34.07s H100 only
- unittest/bindings # 8 mins on H100
- test_e2e.py::test_build_time_benchmark_sanity

View File

@ -14,19 +14,11 @@ l0_perf:
stage: pre_merge
backend: tensorrt
tests:
- perf/test_perf.py::test_perf[bert_base-plugin-float16-bs:32-input_len:32]
- perf/test_perf.py::test_perf[bert_base-cpp-plugin-float16-bs:32-input_len:32]
- perf/test_perf.py::test_perf[bert_base-ootb-float16-bs:32-input_len:32]
- perf/test_perf.py::test_perf[bert_base-cpp-ootb-float16-bs:32-input_len:32]
- perf/test_perf.py::test_perf[roberta_base-cpp-plugin-float16-bs:32-input_len:128+512]
- perf/test_perf.py::test_perf[gpt_350m-plugin-float16-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-ootb-float16-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-ootb-float16-gwp:0.5-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-gwp:0.0-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-static_batching-plugin_ifb-float16-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin-float16-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-static_batching-plugin-float16-bs:32-input_output_len:60,20]
- perf/test_perf.py::test_perf[t5_base-plugin-float16-bs:8-input_output_len:60,20]
- perf/test_perf.py::test_perf[flan_t5_base-plugin-float16-bs:8-input_output_len:60,20]
- perf/test_perf.py::test_perf[bart_large_cnn-plugin-float16-bs:8-input_output_len:60,20]

View File

@ -85,8 +85,6 @@ full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-mini-
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-8k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
full:B200_PCIe/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[roberta_base] SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/functional SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/quantization SKIP (Disable for Blackwell)
full:B200_PCIe/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
@ -102,7 +100,6 @@ full:B200_PCIe/examples/test_medusa.py::test_llm_medusa_with_qaunt_base_model_1g
full:B200_PCIe/unittest/bindings SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/attention/test_sage_attention.py unittest/llmapi/test_llm_download.py unittest/llmapi/test_llm_kv_cache_events.py unittest/llmapi/test_mpi_session.py unittest/trt/model/redrafter unittest/trt/model/test_phi.py unittest/trt/model/test_unet.py unittest/trt/python_plugin unittest/tools unittest/utils unittest/others SKIP (Disable for Blackwell)
full:B200_PCIe/test_e2e.py::test_bert_e2e SKIP (Disable for Blackwell)
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/quantization/test_weight_only_quant_matmul.py SKIP (Disable for Blackwell)
full:B200_PCIe/unittest/trt/quantization/test_weight_only_groupwise_quant_matmul.py SKIP (Disable for Blackwell)
full:B200_PCIe/examples/test_gpt.py::test_llm_gpt2_starcoder_weight_only[starcoder2-int8-float16] SKIP (Disable for Blackwell)
@ -137,7 +134,6 @@ full:B200_PCIe/examples/test_nemotron.py::test_llm_nemotron_3_8b_1gpu[bfloat16-f
full:B200_PCIe/accuracy/test_cli_flow.py::TestMixtral8x7B::test_fp4_plugin SKIP (Disable for Blackwell OOM)
full:B200_PCIe/examples/test_commandr.py::test_llm_commandr_v01_single_gpu_summary[disable_weight_only] SKIP (Disable for Blackwell OOM)
full:B200_PCIe/unittest/llmapi/test_llm_models.py -m "not (part0 or part1)" SKIP (Disable for Blackwell OOM)
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[t5_base] SKIP (Disable for Blackwell for custom mask input)
full:B200/examples/test_llama.py::test_llm_llama_v2_1gpu_auto_parallel[llama-v2-7b-hf] SKIP (Disable for Blackwell)
full:B200/examples/test_mamba.py::test_llm_mamba_1gpu[mamba2-130m-float16-enable_gemm_plugin] SKIP (Disable for Blackwell)
@ -180,8 +176,6 @@ full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-ins
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3-mini-128k-instruct-fp8-float16] SKIP (Disable for Blackwell)
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3.5-mini-instruct-fp8-float16] SKIP (Disable for Blackwell)
full:B200/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
full:B200/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
full:B200/test_e2e.py::test_benchmark_sanity[roberta_base] SKIP (Disable for Blackwell)
full:B200/unittest/trt/functional SKIP (Disable for Blackwell)
full:B200/unittest/trt/quantization SKIP (Disable for Blackwell)
full:B200/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
@ -197,7 +191,6 @@ full:B200/examples/test_medusa.py::test_llm_medusa_with_qaunt_base_model_1gpu[fp
full:B200/unittest/bindings SKIP (Disable for Blackwell)
full:B200/unittest/trt/attention/test_sage_attention.py unittest/llmapi/test_llm_download.py unittest/llmapi/test_llm_kv_cache_events.py unittest/llmapi/test_mpi_session.py unittest/trt/model/redrafter unittest/trt/model/test_phi.py unittest/trt/model/test_unet.py unittest/trt/python_plugin unittest/tools unittest/utils unittest/others SKIP (Disable for Blackwell)
full:B200/test_e2e.py::test_bert_e2e SKIP (Disable for Blackwell)
full:B200/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
full:B200/unittest/trt/quantization/test_weight_only_quant_matmul.py SKIP (Disable for Blackwell)
full:B200/unittest/trt/quantization/test_weight_only_groupwise_quant_matmul.py SKIP (Disable for Blackwell)
full:B200/examples/test_gpt.py::test_llm_gpt2_starcoder_weight_only[starcoder2-int8-float16] SKIP (Disable for Blackwell)
@ -233,7 +226,6 @@ full:B200/accuracy/test_cli_flow.py::TestMixtral8x7B::test_fp4_plugin SKIP (Disa
full:B200/accuracy/test_cli_flow.py::TestMixtral8x7B::test_int8_plugin_tp8 SKIP (INT8/INT4 quantization is not supported on SM>=100.)
full:B200/examples/test_commandr.py::test_llm_commandr_v01_single_gpu_summary[disable_weight_only] SKIP (Disable for Blackwell OOM)
full:B200/unittest/llmapi/test_llm_models.py -m "not (part0 or part1)" SKIP (Disable for Blackwell OOM)
full:B200/test_e2e.py::test_benchmark_sanity[t5_base] SKIP (Disable for Blackwell for custom mask input)
full:B200/examples/test_llama.py::test_llm_llama_code_llama_quantization_4gpus_summary[CodeLlama-34b-Instruct-tp2pp2-int4_awq-nb:4] SKIP (not support on B200)
full:B200/examples/test_llama.py::test_llm_llama_code_llama_quantization_4gpus_summary[CodeLlama-70b-hf-tp2pp2-int4_awq-nb:1] SKIP (not support on B200)
full:B200/examples/test_enc_dec.py::test_llm_enc_dec_general[compare_hf-t5-small-float16-enable_gemm_plugin-enable_attention_plugin-enable_paged_kv_cache-tp:1-pp:1-nb:1] SKIP (not support on B200)