mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
chore: Remove deprecated Python runtime benchmark (#4171)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
parent
f4059c6e2e
commit
6c45586c51
@ -2,11 +2,9 @@
|
||||
|
||||
## Overview
|
||||
|
||||
There are currently three workflows to benchmark TensorRT-LLM:
|
||||
There are currently two workflows to benchmark TensorRT-LLM:
|
||||
* [`trtllm-bench`](../docs/source/performance/perf-benchmarking.md)
|
||||
- `trtllm-bench` is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
|
||||
- _NOTE_: This benchmarking suite is a current work in progress and is prone to large changes.
|
||||
* [C++ benchmarks](./cpp)
|
||||
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
|
||||
* [Python benchmarks](./python)
|
||||
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
|
||||
* [The Python benchmarking suite](../docs/source/performance/perf-benchmarking.md)
|
||||
- This benchmarker is native to TensorRT-LLM and is a Python benchmarker for reproducing and testing the performance of TensorRT-LLM.
|
||||
- _NOTE_: This benchmarking suite is a current work in progress and is prone to large changes.
|
||||
|
||||
@ -1,51 +0,0 @@
|
||||
# Benchmark Python Runtime
|
||||
|
||||
> [!WARNING] Python benchmark is not recommended to be used for benchmarking, please use C++ benchmark instead
|
||||
> The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
|
||||
|
||||
This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
|
||||
multiple GPUs or multiple nodes with multiple GPUs using the Python runtime.
|
||||
|
||||
## Overview
|
||||
|
||||
The benchmark implementation and entrypoint can be found in [`benchmarks/python/benchmark.py`](./benchmark.py). There are some other scripts in the directory:
|
||||
|
||||
* [`benchmarks/python/base_benchmark.py`](./base_benchmark.py) to implement the base class for benchmark.
|
||||
* [`benchmarks/python/gpt_benchmark.py`](./gpt_benchmark.py) to implement benchmark scripts for GPT and GPT-like(LLaMA/OPT/GPT-J/SmoothQuant-GPT) models.
|
||||
* [`benchmarks/python/bert_benchmark.py`](./bert_benchmark.py) to implement benchmark scripts for BERT models.
|
||||
* [`benchmarks/python/enc_dec_benchmark.py`](./enc_dec_benchmark.py) to implement benchmark scripts for Encoder-Decoder models.
|
||||
|
||||
## Usage
|
||||
|
||||
Please use `help` option for detailed usages.
|
||||
```
|
||||
python benchmark.py -h
|
||||
```
|
||||
|
||||
### 1. Single GPU benchmark
|
||||
Take LLaMA 7B as an example:
|
||||
```
|
||||
python benchmark.py \
|
||||
-m dec \
|
||||
--engine_dir llama_7b \
|
||||
--batch_size "1;8;64" \
|
||||
--input_output_len "60,20;128,20"
|
||||
```
|
||||
Expected outputs:
|
||||
```
|
||||
[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 170.77 percentile95(ms) 117.591 percentile99(ms) 124.262 latency(ms) 117.115 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 110.189 total_generated_tokens 19.0 generation_tokens_per_second 172.43
|
||||
[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 8 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 1478.55 percentile95(ms) 108.641 percentile99(ms) 109.546 latency(ms) 108.214 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 98.194 total_generated_tokens 152.0 generation_tokens_per_second 1547.951
|
||||
[BENCHMARK] model_name dec world_size 2 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 64 gpu_weights_percent 1.0 input_length 60 output_length 20 gpu_peak_mem(gb) 0.0 build_time(s) None tokens_per_sec 8214.87 percentile95(ms) 156.748 percentile99(ms) 160.203 latency(ms) 155.815 compute_cap sm90 quantization QuantMode.FP8_QDQ|FP8_KV_CACHE generation_time(ms) 111.078 total_generated_tokens 1216.0 generation_tokens_per_second 10947.303
|
||||
...
|
||||
```
|
||||
*Please note that the expected outputs is only for reference, specific performance numbers depend on the GPU you're using.*
|
||||
|
||||
### 2. Multi-GPU benchmark
|
||||
Take LLaMA 7B as an example:
|
||||
```
|
||||
mpirun -n 2 python benchmark.py \
|
||||
-m dec \
|
||||
--engine_dir llama_7b \
|
||||
--batch_size "1;8;64" \
|
||||
--input_output_len "60,20;128,20"
|
||||
```
|
||||
@ -1,139 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from argparse import ArgumentParser
|
||||
|
||||
# isort: off
|
||||
import torch
|
||||
# isort: on
|
||||
from cuda import cuda, cudart
|
||||
|
||||
import tensorrt_llm as tllm
|
||||
from tensorrt_llm import Mapping, Tensor
|
||||
from tensorrt_llm._utils import local_mpi_rank, local_mpi_size
|
||||
from tensorrt_llm.functional import (AllReduceParams, AllReduceStrategy,
|
||||
allreduce)
|
||||
from tensorrt_llm.plugin.plugin import (current_all_reduce_helper,
|
||||
init_all_reduce_helper)
|
||||
from tensorrt_llm.runtime import Session
|
||||
|
||||
|
||||
def allreduce_benchmark(dtype: str,
|
||||
test_range: str = "10,10000000,10",
|
||||
no_header: bool = False):
|
||||
tllm.logger.set_level('error')
|
||||
world_size = tllm.mpi_world_size()
|
||||
rank = tllm.mpi_rank()
|
||||
local_rank = local_mpi_rank()
|
||||
gpus_per_node = local_mpi_size()
|
||||
|
||||
torch.cuda.set_device(local_rank)
|
||||
cudart.cudaSetDevice(local_rank)
|
||||
|
||||
mapping = Mapping(world_size, rank, gpus_per_node, tp_size=world_size)
|
||||
|
||||
if world_size == 1:
|
||||
raise RuntimeError("Benchmark must run with mpi_world_size > 1")
|
||||
|
||||
torch_dtype = tllm._utils.str_dtype_to_torch(dtype)
|
||||
min_size, max_size, ratio = [int(i) for i in test_range.split(",")]
|
||||
inner_loop = 1000
|
||||
|
||||
size = min_size
|
||||
dtype_size = torch.finfo(torch_dtype).bits // 8
|
||||
if mapping.rank == 0 and not no_header:
|
||||
print(
|
||||
f"{'world_size':<15}, {'dtype':<10}, {'message size':<15}, {'strategy':<15}, {'duration (ms)':<10}"
|
||||
)
|
||||
while size < max_size:
|
||||
input = torch.ones(size, dtype=torch_dtype, device="cuda")
|
||||
|
||||
for strategy in [
|
||||
AllReduceStrategy.AUTO,
|
||||
AllReduceStrategy.NCCL,
|
||||
AllReduceStrategy.ONESHOT,
|
||||
AllReduceStrategy.TWOSHOT,
|
||||
]:
|
||||
builder = tllm.Builder()
|
||||
net = builder.create_network()
|
||||
net.plugin_config.set_nccl_plugin(dtype)
|
||||
init_all_reduce_helper()
|
||||
_buffers, workspace = current_all_reduce_helper(
|
||||
).allocate_workspace(mapping, size * dtype_size)
|
||||
|
||||
with tllm.net_guard(net):
|
||||
tllm.default_trtnet()
|
||||
|
||||
x = Tensor(name='x',
|
||||
shape=input.shape,
|
||||
dtype=tllm.str_dtype_to_trt(dtype))
|
||||
|
||||
current_all_reduce_helper().set_workspace_tensor(mapping)
|
||||
|
||||
current = x
|
||||
for _ in range(inner_loop):
|
||||
current = allreduce(
|
||||
current,
|
||||
mapping.tp_group,
|
||||
all_reduce_params=AllReduceParams(strategy=strategy))
|
||||
current.mark_output('output', dtype)
|
||||
feed_dict = {'x': input, 'all_reduce_workspace': workspace}
|
||||
builder_config = builder.create_builder_config(precision=dtype)
|
||||
engine = builder.build_engine(net, builder_config)
|
||||
assert engine is not None, "Failed to build engine"
|
||||
session = Session.from_serialized_engine(engine)
|
||||
|
||||
_, start = cuda.cuEventCreate(0)
|
||||
_, stop = cuda.cuEventCreate(0)
|
||||
runtimes = []
|
||||
|
||||
tllm.mpi_barrier()
|
||||
output = torch.empty(input.shape, dtype=torch_dtype, device='cuda')
|
||||
stream = torch.cuda.current_stream()
|
||||
for _ in range(10):
|
||||
cuda.cuEventRecord(start, stream.cuda_stream)
|
||||
session.run(inputs=feed_dict,
|
||||
outputs={"output": output},
|
||||
stream=stream.cuda_stream)
|
||||
cuda.cuEventRecord(stop, stream.cuda_stream)
|
||||
torch.cuda.synchronize()
|
||||
_, ms = cuda.cuEventElapsedTime(start, stop)
|
||||
runtimes.append(ms)
|
||||
|
||||
median_ms = sorted(runtimes)[len(runtimes) // 2]
|
||||
|
||||
allreduce_ref = (input * world_size)**inner_loop
|
||||
assert torch.allclose(output, allreduce_ref)
|
||||
|
||||
if mapping.rank == 0:
|
||||
print(
|
||||
f"{mapping.world_size:<15}, {dtype:<10}, {size:<15}, {strategy.name:<15}, {median_ms:<10.2f}"
|
||||
)
|
||||
|
||||
size *= ratio
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = ArgumentParser()
|
||||
parser.add_argument("--dtype", "-t", default="float16")
|
||||
parser.add_argument(
|
||||
"--range",
|
||||
"-r",
|
||||
default="256,256000000,10", # 256 to 256M
|
||||
help="min_size,max_size,multiplicative_ratio")
|
||||
parser.add_argument("--no-header", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
allreduce_benchmark(args.dtype, args.range, args.no_header)
|
||||
@ -1,211 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import time
|
||||
from collections import OrderedDict
|
||||
|
||||
import torch
|
||||
|
||||
import tensorrt_llm
|
||||
from tensorrt_llm.logger import logger
|
||||
from tensorrt_llm.quantization import QuantMode
|
||||
|
||||
|
||||
def get_compute_cap():
|
||||
output = subprocess.check_output(
|
||||
['nvidia-smi', "--query-gpu=compute_cap", "--format=csv"])
|
||||
_, csv_value, *_ = output.splitlines()
|
||||
return str(int(float(csv_value) * 10))
|
||||
|
||||
|
||||
def get_csv_filename(model, dtype, tp_size, **kwargs):
|
||||
sm = get_compute_cap()
|
||||
if len(kwargs) == 0:
|
||||
kw_pairs = ""
|
||||
else:
|
||||
kw_pairs = "_" + "_".join([str(k) + str(v) for k, v in kwargs.items()])
|
||||
return f'{model}_{dtype}_tp{tp_size}_{kw_pairs}_sm{sm}.csv'
|
||||
|
||||
|
||||
def get_engine_name(model, dtype, tp_size, rank):
|
||||
return '{}_{}_tp{}_rank{}.engine'.format(model, dtype, tp_size, rank)
|
||||
|
||||
|
||||
def serialize_engine(engine, path):
|
||||
logger.info(f'Serializing engine to {path}...')
|
||||
tik = time.time()
|
||||
with open(path, 'wb') as f:
|
||||
# engine object is already complies with python buffer protocol, no need to
|
||||
# convert it to bytearray before write, converting to bytearray consumes lots of memory
|
||||
f.write(engine)
|
||||
tok = time.time()
|
||||
t = time.strftime('%H:%M:%S', time.gmtime(tok - tik))
|
||||
logger.info(f'Engine serialized. Total time: {t}')
|
||||
|
||||
|
||||
def get_last_path_component(path):
|
||||
normalized_path = os.path.normpath(path)
|
||||
last_component = os.path.basename(normalized_path)
|
||||
return last_component
|
||||
|
||||
|
||||
class BaseBenchmark(object):
|
||||
|
||||
def __init__(self, engine_dir, model_name, dtype, rank, world_size):
|
||||
self.engine_dir = engine_dir
|
||||
self.model_name = model_name
|
||||
self.dtype = dtype
|
||||
self.runtime_rank = rank
|
||||
self.world_size = world_size
|
||||
self.engine_model_name = model_name
|
||||
self.quant_mode = QuantMode(0)
|
||||
self.enable_fp8 = False
|
||||
# Read config from engine directory
|
||||
config_path = os.path.join(engine_dir, 'config.json')
|
||||
with open(config_path, 'r') as f:
|
||||
self.config = json.load(f)
|
||||
# Sanity checks
|
||||
if 'pretrained_config' in self.config: # new build api branch
|
||||
config_dtype = self.config['pretrained_config']['dtype']
|
||||
assert dtype == config_dtype, f"Engine dtype ({config_dtype}) != Runtime dtype ({dtype})"
|
||||
world_size = self.config['pretrained_config']['mapping'][
|
||||
'world_size']
|
||||
assert world_size == self.world_size, \
|
||||
(f'Engine world size ({world_size}) != Runtime world size ({self.world_size})')
|
||||
# Load config into self
|
||||
for key, value in self.config['pretrained_config'].items():
|
||||
setattr(self, key, value)
|
||||
|
||||
self.quant_mode = QuantMode.from_quant_algo(
|
||||
quant_algo=self.quantization['quant_algo'],
|
||||
kv_cache_quant_algo=self.quantization['kv_cache_quant_algo'])
|
||||
self.enable_fp8 = self.quant_mode.has_fp8_qdq()
|
||||
self.fp8_kv_cache = self.quant_mode.has_fp8_kv_cache()
|
||||
|
||||
for key, value in self.config['build_config'].items():
|
||||
setattr(self, key, value)
|
||||
|
||||
for key, value in self.plugin_config.items():
|
||||
if "plugin" in key:
|
||||
key = "use_" + key
|
||||
setattr(self, key, value)
|
||||
|
||||
self.engine_name = f"rank{self.runtime_rank}.engine"
|
||||
|
||||
self.num_kv_heads = self.num_key_value_heads
|
||||
self.num_layers = self.num_hidden_layers
|
||||
self.num_heads = self.num_attention_heads
|
||||
else:
|
||||
# Read config from engine directory
|
||||
config_path = os.path.join(engine_dir, 'config.json')
|
||||
with open(config_path, 'r') as f:
|
||||
self.config = json.load(f)
|
||||
# Sanity checks
|
||||
config_dtype = self.config['builder_config']['precision']
|
||||
assert dtype == config_dtype, f"Engine dtype ({config_dtype}) != Runtime dtype ({dtype})"
|
||||
world_size = self.config['builder_config']['tensor_parallel']
|
||||
assert world_size == self.world_size, \
|
||||
(f'Engine world size ({world_size}) != Runtime world size ({self.world_size})')
|
||||
# Load config into self
|
||||
for key, value in self.config['builder_config'].items():
|
||||
if key == "quant_mode":
|
||||
self.quant_mode = QuantMode(value)
|
||||
elif key in "name":
|
||||
self.engine_model_name = value
|
||||
else:
|
||||
setattr(self, key, value)
|
||||
self.enable_fp8 = self.quant_mode.has_fp8_qdq()
|
||||
self.fp8_kv_cache = self.quant_mode.has_fp8_kv_cache()
|
||||
for key, value in self.config['plugin_config'].items():
|
||||
# Same effect as self.use_foo_plugin = config.json["foo_plugin"]
|
||||
if "plugin" in key:
|
||||
key = "use_" + key
|
||||
setattr(self, key, value)
|
||||
self.engine_name = get_engine_name(self.engine_model_name,
|
||||
self.dtype, self.world_size,
|
||||
self.runtime_rank)
|
||||
|
||||
self.runtime_mapping = tensorrt_llm.Mapping(world_size=self.world_size,
|
||||
rank=self.runtime_rank,
|
||||
tp_size=self.world_size)
|
||||
|
||||
torch.cuda.set_device(self.runtime_rank %
|
||||
self.runtime_mapping.gpus_per_node)
|
||||
|
||||
self.csv_filename = "" # lazy init
|
||||
|
||||
def get_report_dict(self, benchmark_profiler=None):
|
||||
report_fields = [
|
||||
"engine_dir",
|
||||
"world_size",
|
||||
"num_heads",
|
||||
"num_kv_heads",
|
||||
"num_layers",
|
||||
"hidden_size",
|
||||
"vocab_size",
|
||||
"precision",
|
||||
"batch_size",
|
||||
"gpu_weights_percent",
|
||||
"input_length",
|
||||
"output_length",
|
||||
"gpu_peak_mem(gb)",
|
||||
"build_time(s)",
|
||||
"tokens_per_sec",
|
||||
"percentile95(ms)",
|
||||
"percentile99(ms)",
|
||||
"latency(ms)",
|
||||
"compute_cap",
|
||||
]
|
||||
report_dict = OrderedDict.fromkeys(report_fields)
|
||||
report_dict["engine_dir"] = get_last_path_component(self.engine_dir)
|
||||
report_dict["world_size"] = self.world_size
|
||||
report_dict["precision"] = self.dtype
|
||||
report_dict["quantization"] = str(self.quant_mode)
|
||||
report_dict["compute_cap"] = "sm" + get_compute_cap()
|
||||
return report_dict
|
||||
|
||||
def get_csv_filename(self):
|
||||
if len(self.csv_filename) == 0:
|
||||
self.csv_filename = get_csv_filename(get_last_path_component(
|
||||
self.engine_dir),
|
||||
self.dtype,
|
||||
self.world_size,
|
||||
fp8linear=int(self.enable_fp8))
|
||||
return self.csv_filename
|
||||
|
||||
def print_report_header(self, csv=False, benchmark_profiler=None):
|
||||
if csv and self.runtime_rank == 0:
|
||||
report_dict = self.get_report_dict(benchmark_profiler)
|
||||
line = ",".join(report_dict.keys())
|
||||
print(line)
|
||||
with open(self.get_csv_filename(), "a") as file:
|
||||
file.write(line + "\n")
|
||||
|
||||
def get_config(self):
|
||||
raise NotImplementedError
|
||||
|
||||
def prepare_inputs(self, config):
|
||||
raise NotImplementedError
|
||||
|
||||
def run(self, inputs, config, benchmark_profiler=None):
|
||||
raise NotImplementedError
|
||||
|
||||
def report(self, config, latency):
|
||||
raise NotImplementedError
|
||||
|
||||
def set_weight_streaming(self, config):
|
||||
raise NotImplementedError
|
||||
@ -1,354 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
import multiprocessing as mp
|
||||
from time import time
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
def parse_arguments():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Benchmark TensorRT-LLM models.')
|
||||
parser.add_argument('-m',
|
||||
'--model',
|
||||
type=str,
|
||||
default="dec",
|
||||
choices=["dec", "enc", "enc-dec"],
|
||||
help='Specify type of the model you want to benchmark. '
|
||||
'Choose model between dec/enc/enc-dec.')
|
||||
|
||||
parser.add_argument('--batch_size',
|
||||
type=str,
|
||||
default="8",
|
||||
help=('Specify batch size(s) you want to benchmark. '
|
||||
'Multiple batch sizes can be separated by \";\", '
|
||||
'example: \"1;8;64\".'))
|
||||
parser.add_argument(
|
||||
'--input_len',
|
||||
type=str,
|
||||
default="128",
|
||||
help=('Specify input length(s) you want to benchmark, '
|
||||
'this option is mainly for BERT. '
|
||||
'Multiple input lengths can be separated by \";\", '
|
||||
'example: \"20;60;128\".'))
|
||||
parser.add_argument(
|
||||
'--input_output_len',
|
||||
type=str,
|
||||
default="128,20",
|
||||
help=('Specify input-output length(s) you want to benchmark, '
|
||||
'this option is mainly for GPT and GPT-like models. '
|
||||
'Multiple input lengths can be separated by \";\", '
|
||||
'example: \"60,20;128,20\".'))
|
||||
parser.add_argument(
|
||||
'--dtype',
|
||||
type=str,
|
||||
default='float16',
|
||||
choices=['float16', 'bfloat16', 'float32'],
|
||||
help='Choose data type between float16/bfloat16/float32.')
|
||||
|
||||
parser.add_argument('--num_beams',
|
||||
type=int,
|
||||
default="1",
|
||||
help=('Specify number of beams you want to benchmark.'))
|
||||
parser.add_argument('--top_k',
|
||||
type=int,
|
||||
default="1",
|
||||
help=('Specify Top-K value of decoding.'))
|
||||
parser.add_argument('--top_p',
|
||||
type=float,
|
||||
default="0",
|
||||
help=('Specify Top-P value of decoding.'))
|
||||
parser.add_argument(
|
||||
'--input_timing_cache',
|
||||
type=str,
|
||||
default=None,
|
||||
help=
|
||||
'The path to read timing cache, will be ignored if the file does not exist'
|
||||
)
|
||||
parser.add_argument('--output_timing_cache',
|
||||
type=str,
|
||||
default='model.cache',
|
||||
help='The path to write timing cache')
|
||||
parser.add_argument(
|
||||
'--log_level',
|
||||
type=str,
|
||||
default="error",
|
||||
choices=['verbose', 'info', 'warning', 'error', 'internal_error'],
|
||||
help=
|
||||
'Choose log level between verbose/info/warning/error/internal_error.')
|
||||
parser.add_argument(
|
||||
'--warm_up',
|
||||
type=int,
|
||||
default=2,
|
||||
help='Specify warm up iterations before benchmark starts.')
|
||||
parser.add_argument(
|
||||
'--num_runs',
|
||||
type=int,
|
||||
default=10,
|
||||
help='Minimal number of iterations to run during benchmarking.')
|
||||
parser.add_argument(
|
||||
'--duration',
|
||||
type=int,
|
||||
default=60,
|
||||
help='Minimal duration of iterations to measure in seconds.')
|
||||
|
||||
parser.add_argument(
|
||||
'--engine_dir',
|
||||
type=str,
|
||||
default=None,
|
||||
required=True,
|
||||
help=
|
||||
('If this option is specified, instead of building engines on-air before benchmarking, '
|
||||
'the engines contained in the engine_dir will be used.'))
|
||||
parser.add_argument(
|
||||
'--gpu_weights_percent',
|
||||
type=str,
|
||||
default="1.0",
|
||||
help='Specify the percentage of weights that reside on GPU (from 0 to 1).'
|
||||
'Multiple percentages can be separated by \";\", '
|
||||
'example: \"0;0.5;1\".')
|
||||
|
||||
parser.add_argument('--csv',
|
||||
default=False,
|
||||
action="store_true",
|
||||
help='Output in CSV format.')
|
||||
parser.add_argument('--enable_cuda_graph',
|
||||
default=False,
|
||||
action='store_true',
|
||||
help='Execute GPT session with CUDA graph.')
|
||||
parser.add_argument(
|
||||
'--quantization',
|
||||
type=str,
|
||||
default=None,
|
||||
choices=[
|
||||
'fp8', 'fp8_gemm', 'fp8_kv_cache', 'int8_sq_per_tensor',
|
||||
'int8_sq_per_token_channel', 'int8_weight_only', 'int4_weight_only',
|
||||
'int4_weight_only_awq', 'int4_weight_only_gptq',
|
||||
'int8_sq_per_channel_ootb'
|
||||
],
|
||||
help="Optimize the model with specified quantization recipe")
|
||||
|
||||
parser.add_argument(
|
||||
'--dump_profile',
|
||||
default=False,
|
||||
action='store_true',
|
||||
help="Print profile information per layer (default = disabled)")
|
||||
|
||||
parser.add_argument(
|
||||
'--dump_layer_info',
|
||||
default=False,
|
||||
action='store_true',
|
||||
help=
|
||||
"Print layer information of the engine to console (default = disabled)")
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main(args):
|
||||
# We import tensorrt_llm here because MPI is initialized when
|
||||
# tensorrt_llm is imported, but mpi4py does not work well with
|
||||
# the start method `spawn` of Python multiprocessing,
|
||||
# so we set the start method first, then initialize MPI.
|
||||
from benchmark_profiler import BenchmarkProfiler
|
||||
from bert_benchmark import BERTBenchmark
|
||||
from enc_dec_benchmark import EncDecBenchmark
|
||||
from gpt_benchmark import GPTBenchmark
|
||||
|
||||
import tensorrt_llm
|
||||
from tensorrt_llm.logger import logger
|
||||
|
||||
logger.set_level(args.log_level)
|
||||
|
||||
# Batch size
|
||||
batch_size_options = args.batch_size.split(';')
|
||||
batch_size_options = [int(i) for i in batch_size_options]
|
||||
# Input length (for BERT-like models)
|
||||
input_len_options = args.input_len.split(';')
|
||||
input_len_options = [int(i) for i in input_len_options]
|
||||
# Input-output length combination (for GPT-like models and enc_dec models)
|
||||
in_out_len_options = args.input_output_len.split(';')
|
||||
in_out_len_options = [[int(i) for i in io.split(',')]
|
||||
for io in in_out_len_options]
|
||||
|
||||
# GPU weights percentage ratios
|
||||
gpu_weights_percents = [
|
||||
float(r) for r in args.gpu_weights_percent.split(";")
|
||||
]
|
||||
for percent in gpu_weights_percents:
|
||||
if percent < 0 or percent > 1:
|
||||
raise Exception(
|
||||
f"--gpu_weights_percent only accepts values between 0.0 and 1.0."
|
||||
)
|
||||
|
||||
rank = tensorrt_llm.mpi_rank()
|
||||
world_size = tensorrt_llm.mpi_world_size()
|
||||
|
||||
# TODO: Re-enable memory monitor for multi-gpu benchmarks.
|
||||
# Current Mem Monitor will cause benchmark script hang
|
||||
# because MPI does not work well with multiprocessing.
|
||||
disable_mem_monitor = world_size > 1
|
||||
if not disable_mem_monitor:
|
||||
from mem_monitor import MemoryMonitor
|
||||
|
||||
benchmark_profiler = None
|
||||
if args.model == "dec":
|
||||
benchmark_profiler = BenchmarkProfiler()
|
||||
benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
|
||||
gpu_weights_percents, rank, world_size)
|
||||
elif args.model == "enc":
|
||||
benchmarker = BERTBenchmark(args, batch_size_options, input_len_options,
|
||||
gpu_weights_percents, rank, world_size)
|
||||
elif args.model == "enc-dec":
|
||||
benchmarker = EncDecBenchmark(args, batch_size_options,
|
||||
in_out_len_options, gpu_weights_percents,
|
||||
rank, world_size)
|
||||
else:
|
||||
raise Exception(f'Unexpected model: {args.model}')
|
||||
|
||||
start = torch.cuda.Event(enable_timing=True)
|
||||
end = torch.cuda.Event(enable_timing=True)
|
||||
benchmarker.print_report_header(args.csv,
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
for config in benchmarker.get_config():
|
||||
try:
|
||||
# We pass in config instead of the gpu_weights_percent here to keep this benchmark script
|
||||
# agnostic to the length and contents of the config.
|
||||
benchmarker.set_weight_streaming(config)
|
||||
inputs = benchmarker.prepare_inputs(config)
|
||||
except torch.cuda.OutOfMemoryError as e:
|
||||
logger.error(
|
||||
f'Exception {e} caught while allocating memory; skipping {config}'
|
||||
)
|
||||
continue
|
||||
|
||||
torch.cuda.empty_cache()
|
||||
latencies = []
|
||||
# Disable Host memory monitor when cuda graph is enabled for cuda graph performance.
|
||||
disable_host_mem_monitor = False
|
||||
if args.enable_cuda_graph:
|
||||
logger.warning(
|
||||
'Disable host memory monitor when cuda graph is enabled.')
|
||||
disable_host_mem_monitor = True
|
||||
|
||||
if not disable_mem_monitor:
|
||||
memory_monitor = MemoryMonitor(
|
||||
disable_host_mem_monitor=disable_host_mem_monitor)
|
||||
memory_monitor.start()
|
||||
|
||||
iter_idx = 0
|
||||
try:
|
||||
# Warm up
|
||||
for _ in range(args.warm_up):
|
||||
benchmarker.run(inputs, config)
|
||||
logger.info('Warm up done. Start benchmarking.')
|
||||
if benchmark_profiler is not None:
|
||||
benchmark_profiler.clean()
|
||||
benchmark_profiler.start()
|
||||
cur_duration = 0
|
||||
start_time = time()
|
||||
while iter_idx < args.num_runs or cur_duration < args.duration:
|
||||
start.record()
|
||||
benchmarker.run(inputs,
|
||||
config,
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
end.record()
|
||||
|
||||
torch.cuda.synchronize()
|
||||
latencies.append(start.elapsed_time(end))
|
||||
|
||||
iter_idx += 1
|
||||
cur_duration = round(time() - start_time, 3)
|
||||
logger.info(
|
||||
f'Benchmarking done. Iteration: {iter_idx}, duration: {cur_duration} sec.'
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Found exception during benchmarking",
|
||||
e.with_traceback())
|
||||
if not disable_mem_monitor:
|
||||
memory_monitor.kill()
|
||||
raise e
|
||||
|
||||
if not disable_mem_monitor:
|
||||
memory_monitor.stop()
|
||||
_, peak_gpu_used = memory_monitor.get_peak_memory_usage("GiB")
|
||||
peak_gpu_used = round(peak_gpu_used, 3)
|
||||
else:
|
||||
peak_gpu_used = 0.0
|
||||
|
||||
if benchmark_profiler is not None:
|
||||
benchmark_profiler.add_aux_info('iter_count', iter_idx)
|
||||
benchmark_profiler.stop()
|
||||
|
||||
# Print latencies to make it easier to check perf stability.
|
||||
if len(latencies) <= 20:
|
||||
latencies_str = str(latencies)
|
||||
else:
|
||||
latencies_str = ("[" + ", ".join([str(l) for l in latencies[:10]]) +
|
||||
"..." +
|
||||
", ".join([str(l) for l in latencies[-10:]]) + "]")
|
||||
logger.info(f"Latencies: {latencies_str}")
|
||||
|
||||
latency = round(sum(latencies) / iter_idx, 3)
|
||||
latencies.sort()
|
||||
percentile95 = round(latencies[int(iter_idx * 0.95)], 3)
|
||||
percentile99 = round(latencies[int(iter_idx * 0.99)], 3)
|
||||
benchmarker.report(config,
|
||||
latency,
|
||||
percentile95,
|
||||
percentile99,
|
||||
peak_gpu_used,
|
||||
csv=args.csv,
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
|
||||
# Rerun for dumping profile per layer.
|
||||
if args.dump_profile and benchmark_profiler is not None:
|
||||
benchmark_profiler.set_recording_perf_profile(True)
|
||||
logger.info(f'Dump profile information per layer')
|
||||
iter_idx = 0
|
||||
try:
|
||||
# Warm up
|
||||
for _ in range(args.warm_up):
|
||||
benchmarker.run(inputs, config)
|
||||
if benchmark_profiler is not None:
|
||||
benchmark_profiler.clean()
|
||||
benchmark_profiler.start()
|
||||
cur_duration = 0
|
||||
start_time = time()
|
||||
while iter_idx < args.num_runs or cur_duration < args.duration:
|
||||
start.record()
|
||||
benchmarker.run(inputs,
|
||||
config,
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
end.record()
|
||||
torch.cuda.synchronize()
|
||||
latencies.append(start.elapsed_time(end))
|
||||
iter_idx += 1
|
||||
cur_duration = round(time() - start_time, 3)
|
||||
benchmarker.report_profiler(
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
except Exception as e:
|
||||
logger.error("Found exception during benchmarking",
|
||||
e.with_traceback())
|
||||
if not disable_mem_monitor:
|
||||
memory_monitor.kill()
|
||||
raise e
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
mp.set_start_method('spawn')
|
||||
args = parse_arguments()
|
||||
main(args)
|
||||
@ -1,82 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
class BenchmarkProfiler(object):
|
||||
cuda_event_dict: dict
|
||||
timer_dict: dict
|
||||
aux_info: dict
|
||||
started: bool
|
||||
is_recording_perf_profile: bool
|
||||
|
||||
def __init__(self):
|
||||
self.cuda_event_dict = {}
|
||||
self.timer_dict = {}
|
||||
self.aux_info = {}
|
||||
self.started = False
|
||||
self.is_recording_perf_profile = False
|
||||
|
||||
def clean(self):
|
||||
self.cuda_event_dict = {}
|
||||
self.timer_dict = {}
|
||||
self.aux_info = {}
|
||||
|
||||
def start(self):
|
||||
self.started = True
|
||||
|
||||
def stop(self):
|
||||
self.started = False
|
||||
|
||||
def get_cuda_event(self, name: str):
|
||||
if name not in self.cuda_event_dict.keys():
|
||||
event = torch.cuda.Event(enable_timing=True)
|
||||
self.cuda_event_dict[name] = event
|
||||
return self.cuda_event_dict[name]
|
||||
|
||||
def record_cuda_event(self, name: str):
|
||||
if not self.started:
|
||||
return
|
||||
event = self.get_cuda_event(name)
|
||||
event.record()
|
||||
|
||||
def get_timer_value(self, timer_name: str):
|
||||
# timer is in milliseconds
|
||||
return self.timer_dict[timer_name]
|
||||
|
||||
def record_elapsed_time(self, start_event_name: str, end_event_name: str,
|
||||
timer_name: str):
|
||||
if timer_name not in self.timer_dict.keys():
|
||||
self.timer_dict[timer_name] = 0.0
|
||||
if not self.started:
|
||||
return
|
||||
self.get_cuda_event(start_event_name).synchronize()
|
||||
self.get_cuda_event(end_event_name).synchronize()
|
||||
self.timer_dict[timer_name] += self.get_cuda_event(
|
||||
start_event_name).elapsed_time(self.get_cuda_event(end_event_name))
|
||||
|
||||
def get_aux_info(self, aux_name):
|
||||
return self.aux_info[aux_name]
|
||||
|
||||
def add_aux_info(self, aux_name: str, add_value):
|
||||
if aux_name not in self.aux_info.keys():
|
||||
self.aux_info[aux_name] = 0
|
||||
if not self.started:
|
||||
return
|
||||
self.aux_info[aux_name] += add_value
|
||||
|
||||
def set_recording_perf_profile(self, value: bool):
|
||||
self.is_recording_perf_profile = value
|
||||
@ -1,137 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
|
||||
# isort: off
|
||||
import torch
|
||||
import tensorrt as trt
|
||||
#isort: on
|
||||
from base_benchmark import BaseBenchmark
|
||||
|
||||
import tensorrt_llm
|
||||
from tensorrt_llm._utils import trt_dtype_to_torch
|
||||
from tensorrt_llm.runtime import TensorInfo
|
||||
|
||||
|
||||
class BERTBenchmark(BaseBenchmark):
|
||||
|
||||
def __init__(self, args, batch_sizes, in_lens, gpu_weights_percents, rank,
|
||||
world_size):
|
||||
super().__init__(args.engine_dir, args.model, args.dtype, rank,
|
||||
world_size)
|
||||
self.batch_sizes = batch_sizes
|
||||
self.in_lens = in_lens
|
||||
self.build_time = 0
|
||||
self.gpu_weights_percents = gpu_weights_percents
|
||||
|
||||
# Deserialize engine from engine directory
|
||||
self.serialize_path = os.path.join(args.engine_dir, self.engine_name)
|
||||
with open(self.serialize_path, 'rb') as f:
|
||||
engine_buffer = f.read()
|
||||
assert engine_buffer is not None
|
||||
|
||||
self.session = tensorrt_llm.runtime.Session.from_serialized_engine(
|
||||
engine_buffer)
|
||||
|
||||
# Print context memory size for CI/CD to track.
|
||||
context_mem_size = self.session.context_mem_size
|
||||
print(
|
||||
f"Allocated {context_mem_size / 1048576.0:.2f} MiB for execution context memory."
|
||||
)
|
||||
|
||||
def get_config(self):
|
||||
for inlen in self.in_lens:
|
||||
if inlen > self.max_input_len:
|
||||
continue
|
||||
for batch_size in self.batch_sizes:
|
||||
if batch_size > self.max_batch_size:
|
||||
continue
|
||||
for gpu_weights_percent in self.gpu_weights_percents:
|
||||
yield (batch_size, inlen, gpu_weights_percent)
|
||||
|
||||
def set_weight_streaming(self, config):
|
||||
gpu_weights_percent = config[2]
|
||||
self.session._set_weight_streaming(gpu_weights_percent)
|
||||
|
||||
def prepare_inputs(self, config):
|
||||
batch_size, inlen = config[0], config[1]
|
||||
input_ids = torch.randint(100, (batch_size, inlen)).int().cuda()
|
||||
input_lengths = inlen * torch.ones(
|
||||
(batch_size, ), dtype=torch.int32, device='cuda')
|
||||
inputs = {'input_ids': input_ids, 'input_lengths': input_lengths}
|
||||
output_info = self.session.infer_shapes([
|
||||
TensorInfo('input_ids', trt.DataType.INT32, input_ids.shape),
|
||||
TensorInfo('input_lengths', trt.DataType.INT32, input_lengths.shape)
|
||||
])
|
||||
outputs = {
|
||||
t.name:
|
||||
torch.empty(tuple(t.shape),
|
||||
dtype=trt_dtype_to_torch(t.dtype),
|
||||
device='cuda')
|
||||
for t in output_info
|
||||
}
|
||||
stream = torch.cuda.current_stream().cuda_stream
|
||||
return (inputs, outputs, stream)
|
||||
|
||||
def run(self, inputs, config, benchmark_profiler=None):
|
||||
ok = self.session.run(*inputs)
|
||||
assert ok, "Runtime execution failed"
|
||||
torch.cuda.synchronize()
|
||||
|
||||
def report(self, config, latency, percentile95, percentile99,
|
||||
peak_gpu_used):
|
||||
if self.runtime_rank == 0:
|
||||
line = '[BENCHMARK] ' + (
|
||||
f'model_name {self.model_name} world_size {self.world_size} precision {self.dtype} '
|
||||
f'batch_size {config[0]} input_length {config[1]} gpu_peak_mem(gb) {peak_gpu_used} '
|
||||
f'build_time(s) {self.build_time} percentile95(ms) {percentile95} '
|
||||
f'percentile99(ms) {percentile99} latency(ms) {latency}')
|
||||
print(line)
|
||||
|
||||
def report(self,
|
||||
config,
|
||||
latency,
|
||||
percentile95,
|
||||
percentile99,
|
||||
peak_gpu_used,
|
||||
csv,
|
||||
benchmark_profiler=None):
|
||||
report_dict = super().get_report_dict()
|
||||
batch_size, inlen = config[0], config[1]
|
||||
report_dict["num_heads"] = self.num_heads
|
||||
report_dict["num_kv_heads"] = self.num_heads
|
||||
report_dict["num_layers"] = self.num_layers
|
||||
report_dict["hidden_size"] = self.hidden_size
|
||||
report_dict["vocab_size"] = self.vocab_size
|
||||
report_dict["batch_size"] = batch_size
|
||||
report_dict["input_length"] = inlen
|
||||
report_dict["output_length"] = "n/a"
|
||||
report_dict["gpu_weights_percent"] = config[2]
|
||||
report_dict["latency(ms)"] = latency
|
||||
report_dict["build_time(s)"] = self.build_time
|
||||
report_dict["tokens_per_sec"] = "n/a"
|
||||
report_dict["percentile95(ms)"] = percentile95
|
||||
report_dict["percentile99(ms)"] = percentile99
|
||||
report_dict["gpu_peak_mem(gb)"] = peak_gpu_used
|
||||
if self.runtime_rank == 0:
|
||||
if csv:
|
||||
line = ",".join([str(v) for v in report_dict.values()])
|
||||
print(line)
|
||||
with open(self.get_csv_filename(), "a") as file:
|
||||
file.write(line + "\n")
|
||||
else:
|
||||
kv_pairs = [f"{k} {v}" for k, v in report_dict.items()]
|
||||
line = '[BENCHMARK] ' + " ".join(kv_pairs)
|
||||
print(line)
|
||||
@ -1,174 +0,0 @@
|
||||
import json
|
||||
import os
|
||||
from enum import Enum
|
||||
|
||||
import evaluate
|
||||
import nltk
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from transformers import AutoTokenizer, LlamaTokenizerFast
|
||||
|
||||
nltk.download("punkt", quiet=False)
|
||||
nltk.download('punkt_tab')
|
||||
import argparse
|
||||
|
||||
|
||||
class Model(Enum):
|
||||
Llama_v2_70B = 1
|
||||
GPT_J = 2
|
||||
|
||||
|
||||
ACCURACY_TARGETS = {
|
||||
Model.Llama_v2_70B: {
|
||||
"rouge1": 44.4312 * 0.999,
|
||||
"rouge2": 22.0352 * 0.999,
|
||||
"rougeL": 28.6162 * 0.999,
|
||||
"tokens_per_sample": 294.45 * 0.9
|
||||
},
|
||||
Model.GPT_J: {
|
||||
"rouge1": 42.9865 * 0.99,
|
||||
"rouge2": 20.1235 * 0.99,
|
||||
"rougeL": 29.9881 * 0.99,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def get_reference_df(processed_dataset_file):
|
||||
data = pd.read_pickle(processed_dataset_file)
|
||||
return data["output"].tolist()
|
||||
|
||||
|
||||
def get_reference_json(cnn_dailymail_valset):
|
||||
# Load from CNN dailymail
|
||||
with open(cnn_dailymail_valset, 'r') as fh:
|
||||
list_data_dict = json.load(fh)
|
||||
|
||||
targets = [f"{example['output']}" for example in list_data_dict]
|
||||
|
||||
print(f"Loaded {len(targets)} samples from {cnn_dailymail_valset}")
|
||||
return targets
|
||||
|
||||
|
||||
def get_responses_json(response_file):
|
||||
f = open(response_file)
|
||||
responses = json.load(f)
|
||||
ordered_responses = sorted(responses, key=lambda x: int(x['response_id']))
|
||||
return ordered_responses
|
||||
|
||||
|
||||
def postprocess_text(preds, targets):
|
||||
# Post-process output texts for ROUGE evaluation
|
||||
preds = [pred.strip() for pred in preds]
|
||||
targets = [target.strip() for target in targets]
|
||||
|
||||
# rougeLSum expects newline after each sentence
|
||||
preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
|
||||
targets = ["\n".join(nltk.sent_tokenize(target)) for target in targets]
|
||||
|
||||
return preds, targets
|
||||
|
||||
|
||||
def strip_eos(pred_toks, eos_id):
|
||||
while len(pred_toks) > 0 and pred_toks[-1] == eos_id:
|
||||
pred_toks.pop()
|
||||
if len(pred_toks) == 0:
|
||||
raise RuntimeError("Empty output sequence detected with EOS")
|
||||
return pred_toks
|
||||
|
||||
|
||||
def calculate_toks_per_sample(preds, eos_id):
|
||||
preds = [strip_eos(pred, eos_id) for pred in preds]
|
||||
avg_len = sum(len(pred) for pred in preds)
|
||||
num_samples = len(preds)
|
||||
return avg_len / num_samples
|
||||
|
||||
|
||||
def calculate_rouge_score(preds, targets, rouge_dir=None):
|
||||
print("Calculating ROUGE scores...")
|
||||
rouge_dir = rouge_dir if rouge_dir and os.path.exists(
|
||||
rouge_dir) else "rouge"
|
||||
metric = evaluate.load(rouge_dir)
|
||||
preds, targets = postprocess_text(preds, targets[0:len(preds)])
|
||||
result = metric.compute(predictions=preds,
|
||||
references=targets,
|
||||
use_stemmer=True,
|
||||
use_aggregator=False)
|
||||
result = {k: round(np.mean(v) * 100, 4) for k, v in result.items()}
|
||||
prediction_lens = [len(pred) for pred in preds]
|
||||
result["gen_len"] = np.sum(prediction_lens)
|
||||
result["gen_num"] = len(preds)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def parse_arguments():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--dataset",
|
||||
type=str,
|
||||
help=
|
||||
"Path to the reference dataset against which the responses are evaluated for accuracy. MLPerf uses open-orca (pkl) and cnn-dailymail (np) for Llama2-70B and GPT-J respectively."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--responses",
|
||||
type=str,
|
||||
help="Path to the json file holding the responses from our benchmark run"
|
||||
)
|
||||
parser.add_argument("--base_model",
|
||||
type=str,
|
||||
help="Location of the model used (to create tokenizer)")
|
||||
|
||||
parser.add_argument(
|
||||
'--rouge_dir',
|
||||
default=None,
|
||||
type=str,
|
||||
help=
|
||||
"evaluate.load('rouge') will attempt to pull rouge package from HF. Use cached rouge can avoid network outage of host or HF."
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
return args
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_arguments()
|
||||
|
||||
if args.dataset.lower().endswith(".pkl"):
|
||||
target_texts = get_reference_df(args.dataset)
|
||||
model = Model.Llama_v2_70B
|
||||
tokenizer = LlamaTokenizerFast.from_pretrained(args.base_model)
|
||||
elif args.dataset.lower().endswith(".json"):
|
||||
target_texts = get_reference_json(args.dataset)
|
||||
model = Model.GPT_J
|
||||
tokenizer = AutoTokenizer.from_pretrained(args.base_model,
|
||||
model_max_length=2047,
|
||||
padding_side="left",
|
||||
use_fast=False)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
else:
|
||||
raise RuntimeError(
|
||||
"Dataset expected to be pkl (open-orca) or json (cnn-dailymail)")
|
||||
|
||||
pred_out = get_responses_json(args.responses)
|
||||
pred_toks = [x['response_tokens'] for x in pred_out]
|
||||
|
||||
tps_score = calculate_toks_per_sample(pred_toks, tokenizer.eos_token)
|
||||
|
||||
pred_texts = tokenizer.batch_decode(pred_toks, skip_special_tokens=True)
|
||||
achieved_scores = calculate_rouge_score(pred_texts, target_texts,
|
||||
args.rouge_dir)
|
||||
|
||||
achieved_scores['tokens_per_sample'] = tps_score
|
||||
targets = ACCURACY_TARGETS[model]
|
||||
|
||||
print("Achieved rouge scores: ", achieved_scores)
|
||||
print("Tokens per sample: ", tps_score)
|
||||
print("Targets: ", targets)
|
||||
|
||||
for k, _ in targets.items():
|
||||
assert targets[k] <= achieved_scores[k]
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,456 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import json
|
||||
import os
|
||||
|
||||
# isort: off
|
||||
import torch
|
||||
#isort: on
|
||||
from base_benchmark import BaseBenchmark
|
||||
|
||||
import tensorrt_llm
|
||||
from tensorrt_llm._utils import (trt_dtype_to_torch, str_dtype_to_trt)
|
||||
from tensorrt_llm.quantization import QuantMode
|
||||
from tensorrt_llm.runtime.session import TensorInfo
|
||||
from tensorrt_llm.runtime import ModelConfig
|
||||
from tensorrt_llm.models.modeling_utils import get_kv_cache_type_from_legacy
|
||||
|
||||
|
||||
class EncDecBenchmark(BaseBenchmark):
|
||||
|
||||
def __init__(self, args, batch_sizes, in_out_lens, gpu_weights_percents,
|
||||
rank, world_size):
|
||||
self.engine_dir = args.engine_dir
|
||||
self.model_name = args.model
|
||||
self.enable_fp8 = False # hardcode for enc-dec models
|
||||
self.dtype = args.dtype
|
||||
self.runtime_rank = rank
|
||||
self.world_size = world_size
|
||||
self.csv_filename = "" # lazy init
|
||||
self.batch_sizes = batch_sizes
|
||||
self.in_out_lens = in_out_lens
|
||||
self.num_beams = args.num_beams
|
||||
self.build_time = 0
|
||||
self.quant_mode = QuantMode(0)
|
||||
# In current implementation, encoder and decoder have the same name,
|
||||
# builder config, and plugin config. But they can be different in the future.
|
||||
# So we use separate variables for encoder and decoder here.
|
||||
self.encoder_engine_model_name = args.model
|
||||
self.decoder_engine_model_name = args.model
|
||||
self.gpu_weights_percents = gpu_weights_percents
|
||||
|
||||
# only for whisper parameter
|
||||
self.n_mels = 0
|
||||
|
||||
if self.engine_dir is not None:
|
||||
|
||||
def read_config(component):
|
||||
# almost same as enc_dec_model_runner.py::read_config()
|
||||
config_path = os.path.join(self.engine_dir, component,
|
||||
"config.json")
|
||||
with open(config_path, "r") as f:
|
||||
config = json.load(f)
|
||||
|
||||
builder_config = config['build_config']
|
||||
plugin_config = builder_config['plugin_config']
|
||||
pretrained_config = config['pretrained_config']
|
||||
lora_config = builder_config['lora_config']
|
||||
auto_parallel_config = builder_config['auto_parallel_config']
|
||||
use_gpt_attention_plugin = plugin_config["gpt_attention_plugin"]
|
||||
gemm_allreduce_plugin = plugin_config["gemm_allreduce_plugin"]
|
||||
remove_input_padding = plugin_config["remove_input_padding"]
|
||||
use_lora_plugin = plugin_config["lora_plugin"]
|
||||
tp_size = pretrained_config['mapping']['tp_size']
|
||||
pp_size = pretrained_config['mapping']['pp_size']
|
||||
auto_parallel_config['gpus_per_node']
|
||||
world_size = tp_size * pp_size
|
||||
assert world_size == tensorrt_llm.mpi_world_size(), \
|
||||
f'Engine world size ({world_size}) != Runtime world size ({tensorrt_llm.mpi_world_size()})'
|
||||
num_heads = pretrained_config["num_attention_heads"]
|
||||
hidden_size = pretrained_config["hidden_size"]
|
||||
head_size = pretrained_config["head_size"]
|
||||
vocab_size = pretrained_config["vocab_size"]
|
||||
max_batch_size = builder_config["max_batch_size"]
|
||||
max_beam_width = builder_config["max_beam_width"]
|
||||
num_layers = pretrained_config["num_hidden_layers"]
|
||||
num_kv_heads = pretrained_config.get('num_kv_heads', num_heads)
|
||||
|
||||
assert (num_heads % tp_size) == 0
|
||||
num_heads = num_heads // tp_size
|
||||
hidden_size = hidden_size // tp_size
|
||||
num_kv_heads = (num_kv_heads + tp_size - 1) // tp_size
|
||||
|
||||
cross_attention = pretrained_config[
|
||||
"architecture"] == "DecoderModel"
|
||||
skip_cross_kv = pretrained_config.get('skip_cross_kv', False)
|
||||
has_position_embedding = pretrained_config[
|
||||
"has_position_embedding"]
|
||||
has_token_type_embedding = hasattr(pretrained_config,
|
||||
"type_vocab_size")
|
||||
dtype = pretrained_config["dtype"]
|
||||
|
||||
paged_kv_cache = plugin_config['paged_kv_cache']
|
||||
kv_cache_type = get_kv_cache_type_from_legacy(
|
||||
True, paged_kv_cache)
|
||||
|
||||
tokens_per_block = plugin_config['tokens_per_block']
|
||||
|
||||
gather_context_logits = builder_config.get(
|
||||
'gather_context_logits', False)
|
||||
gather_generation_logits = builder_config.get(
|
||||
'gather_generation_logits', False)
|
||||
max_prompt_embedding_table_size = builder_config.get(
|
||||
'max_prompt_embedding_table_size', 0)
|
||||
|
||||
model_config = ModelConfig(
|
||||
num_heads=num_heads,
|
||||
num_kv_heads=num_kv_heads,
|
||||
hidden_size=hidden_size,
|
||||
head_size=head_size,
|
||||
max_batch_size=max_batch_size,
|
||||
max_beam_width=max_beam_width,
|
||||
vocab_size=vocab_size,
|
||||
num_layers=num_layers,
|
||||
gpt_attention_plugin=use_gpt_attention_plugin,
|
||||
gemm_allreduce_plugin=gemm_allreduce_plugin,
|
||||
remove_input_padding=remove_input_padding,
|
||||
kv_cache_type=kv_cache_type,
|
||||
tokens_per_block=tokens_per_block,
|
||||
cross_attention=cross_attention,
|
||||
has_position_embedding=has_position_embedding,
|
||||
has_token_type_embedding=has_token_type_embedding,
|
||||
dtype=dtype,
|
||||
gather_context_logits=gather_context_logits,
|
||||
gather_generation_logits=gather_generation_logits,
|
||||
max_prompt_embedding_table_size=
|
||||
max_prompt_embedding_table_size,
|
||||
lora_plugin=use_lora_plugin,
|
||||
lora_target_modules=lora_config.get('lora_target_modules'),
|
||||
trtllm_modules_to_hf_modules=lora_config.get(
|
||||
'trtllm_modules_to_hf_modules'),
|
||||
skip_cross_kv=skip_cross_kv,
|
||||
)
|
||||
|
||||
# additional info for benchmark
|
||||
self.max_batch_size = config["build_config"]["max_batch_size"]
|
||||
self.max_input_len = config["build_config"][
|
||||
"max_encoder_input_len"]
|
||||
self.max_seq_len = config["build_config"]["max_seq_len"]
|
||||
if component == "decoder":
|
||||
self.decoder_start_token_id = pretrained_config[
|
||||
'decoder_start_token_id']
|
||||
|
||||
return model_config
|
||||
|
||||
self.encoder_model_config = read_config("encoder")
|
||||
self.decoder_model_config = read_config("decoder")
|
||||
|
||||
self.encoder_engine_name = 'rank{}.engine'.format(self.runtime_rank)
|
||||
self.decoder_engine_name = 'rank{}.engine'.format(self.runtime_rank)
|
||||
self.encoder_runtime_mapping = tensorrt_llm.Mapping(
|
||||
world_size=self.world_size,
|
||||
rank=self.runtime_rank,
|
||||
tp_size=self.world_size,
|
||||
)
|
||||
self.decoder_runtime_mapping = tensorrt_llm.Mapping(
|
||||
world_size=self.world_size,
|
||||
rank=self.runtime_rank,
|
||||
tp_size=self.world_size,
|
||||
)
|
||||
|
||||
torch.cuda.set_device(self.runtime_rank %
|
||||
self.encoder_runtime_mapping.gpus_per_node)
|
||||
self.device = torch.cuda.current_device()
|
||||
|
||||
# Deserialize engine from engine directory
|
||||
self.encoder_serialize_path = os.path.join(self.engine_dir, "encoder",
|
||||
self.encoder_engine_name)
|
||||
with open(self.encoder_serialize_path, "rb") as f:
|
||||
encoder_engine_buffer = f.read()
|
||||
assert encoder_engine_buffer is not None
|
||||
self.decoder_serialize_path = os.path.join(self.engine_dir, "decoder",
|
||||
self.decoder_engine_name)
|
||||
with open(self.decoder_serialize_path, "rb") as f:
|
||||
decoder_engine_buffer = f.read()
|
||||
assert decoder_engine_buffer is not None
|
||||
|
||||
# session setup
|
||||
self.encoder_session = tensorrt_llm.runtime.Session.from_serialized_engine(
|
||||
encoder_engine_buffer)
|
||||
self.decoder_session = tensorrt_llm.runtime.GenerationSession(
|
||||
self.decoder_model_config, decoder_engine_buffer,
|
||||
self.decoder_runtime_mapping)
|
||||
|
||||
# Print context memory size for CI/CD to track.
|
||||
context_mem_size = self.encoder_session.context_mem_size + self.decoder_session.context_mem_size
|
||||
print(
|
||||
f"Allocated {context_mem_size / 1048576.0:.2f} MiB for execution context memory."
|
||||
)
|
||||
|
||||
def get_config(self):
|
||||
if 'whisper' in self.model_name:
|
||||
print(
|
||||
f"[WARNING] whisper benchmark is input_len=1500, no text prompt, output_len=arbitrary"
|
||||
)
|
||||
for inlen, outlen in self.in_out_lens:
|
||||
if (inlen > self.max_input_len or outlen > self.max_seq_len):
|
||||
print(
|
||||
f"[WARNING] check inlen({inlen}) <= max_inlen({self.max_input_len}) and "
|
||||
f"outlen({outlen}) <= max_seqlen({self.max_seq_len}) failed, skipping."
|
||||
)
|
||||
continue
|
||||
for batch_size in self.batch_sizes:
|
||||
if batch_size > self.max_batch_size:
|
||||
print(
|
||||
f"[WARNING] check batch_size({batch_size}) "
|
||||
f"<= max_batch_size({self.max_batch_size}) failed, skipping."
|
||||
)
|
||||
continue
|
||||
for gpu_weights_percent in self.gpu_weights_percents:
|
||||
yield (batch_size, inlen, outlen, gpu_weights_percent)
|
||||
|
||||
def set_weight_streaming(self, config):
|
||||
gpu_weights_percent = config[3]
|
||||
self.encoder_session._set_weight_streaming(gpu_weights_percent)
|
||||
self.decoder_session.runtime._set_weight_streaming(gpu_weights_percent)
|
||||
|
||||
def prepare_inputs(self, config):
|
||||
batch_size, encoder_input_len, output_len = config[0], config[
|
||||
1], config[2]
|
||||
attention_mask = None
|
||||
whisper_decoder_encoder_input_lengths = None
|
||||
outputs = {}
|
||||
if 'whisper' in self.model_name:
|
||||
# feature_len always fixed 3000 now
|
||||
feature_len = 3000
|
||||
encoder_input_ids = (torch.randint(
|
||||
1, 100, (batch_size, self.n_mels, feature_len)).int().cuda())
|
||||
encoder_input_lengths = torch.tensor([
|
||||
encoder_input_ids.shape[2] // 2
|
||||
for _ in range(encoder_input_ids.shape[0])
|
||||
],
|
||||
dtype=torch.int32,
|
||||
device=self.device)
|
||||
decoder_input_ids = (torch.randint(1, 100, (1, )).int().cuda())
|
||||
decoder_input_ids = decoder_input_ids.repeat(
|
||||
(encoder_input_ids.shape[0], 1))
|
||||
output_list = [
|
||||
TensorInfo('input_features', str_dtype_to_trt(self.dtype),
|
||||
encoder_input_ids.shape),
|
||||
TensorInfo('input_lengths', str_dtype_to_trt('int32'),
|
||||
encoder_input_lengths.shape)
|
||||
]
|
||||
output_info = (self.encoder_session).infer_shapes(output_list)
|
||||
outputs = {
|
||||
t.name:
|
||||
torch.empty(tuple(t.shape),
|
||||
dtype=trt_dtype_to_torch(t.dtype),
|
||||
device='cuda')
|
||||
for t in output_info
|
||||
}
|
||||
whisper_decoder_encoder_input_lengths = torch.tensor(
|
||||
[
|
||||
outputs['encoder_output'].shape[1]
|
||||
for x in range(outputs['encoder_output'].shape[0])
|
||||
],
|
||||
dtype=torch.int32,
|
||||
device='cuda')
|
||||
|
||||
decoder_input_lengths = torch.tensor([
|
||||
decoder_input_ids.shape[-1]
|
||||
for _ in range(decoder_input_ids.shape[0])
|
||||
],
|
||||
dtype=torch.int32,
|
||||
device='cuda')
|
||||
cross_attention_mask = torch.ones([
|
||||
outputs['encoder_output'].shape[0],
|
||||
decoder_input_lengths.max() + output_len,
|
||||
outputs['encoder_output'].shape[1]
|
||||
]).int().cuda()
|
||||
else:
|
||||
encoder_input_ids = (torch.randint(
|
||||
100, (batch_size, encoder_input_len)).int().cuda())
|
||||
decoder_input_ids = torch.IntTensor([[self.decoder_start_token_id]
|
||||
]).to(self.device)
|
||||
decoder_input_ids = decoder_input_ids.repeat((batch_size, 1))
|
||||
encoder_input_lengths = torch.tensor([encoder_input_len] *
|
||||
batch_size,
|
||||
dtype=torch.int32,
|
||||
device=self.device)
|
||||
decoder_input_lengths = torch.tensor([1] * batch_size,
|
||||
dtype=torch.int32,
|
||||
device=self.device)
|
||||
|
||||
if self.encoder_model_config.remove_input_padding:
|
||||
encoder_input_ids = torch.flatten(encoder_input_ids)
|
||||
decoder_input_ids = torch.flatten(decoder_input_ids)
|
||||
|
||||
# attention mask, always set 1 as if all are valid tokens
|
||||
attention_mask = torch.ones(
|
||||
(batch_size, encoder_input_len)).int().cuda()
|
||||
# cross attention mask, always set 1 as if all are valid tokens
|
||||
# [batch_size, query_len, encoder_input_len] currently, use query_len=1
|
||||
cross_attention_mask = [
|
||||
torch.ones(decoder_input_lengths.max() + output_len,
|
||||
encoder_input_len).int().cuda()
|
||||
for _ in range(batch_size)
|
||||
]
|
||||
|
||||
hidden_size = (self.encoder_model_config.hidden_size *
|
||||
self.world_size) # tp_size
|
||||
hidden_states_shape = (
|
||||
encoder_input_ids.shape[0],
|
||||
hidden_size,
|
||||
) if self.encoder_model_config.remove_input_padding else (
|
||||
encoder_input_ids.shape[0],
|
||||
encoder_input_ids.shape[1],
|
||||
hidden_size,
|
||||
)
|
||||
hidden_states_dtype = lambda name: trt_dtype_to_torch(
|
||||
self.encoder_session.engine.get_tensor_dtype(name))
|
||||
|
||||
outputs["encoder_output"] = torch.empty(
|
||||
hidden_states_shape,
|
||||
dtype=hidden_states_dtype("encoder_output"),
|
||||
device=self.device,
|
||||
).contiguous()
|
||||
|
||||
stream = torch.cuda.current_stream().cuda_stream
|
||||
return (
|
||||
encoder_input_ids,
|
||||
encoder_input_lengths,
|
||||
attention_mask,
|
||||
decoder_input_ids,
|
||||
decoder_input_lengths,
|
||||
cross_attention_mask,
|
||||
whisper_decoder_encoder_input_lengths,
|
||||
outputs,
|
||||
stream,
|
||||
)
|
||||
|
||||
def run(self, inputs, config, benchmark_profiler=None):
|
||||
output_len = config[2]
|
||||
(
|
||||
encoder_input_ids,
|
||||
encoder_input_lengths,
|
||||
attention_mask,
|
||||
decoder_input_ids,
|
||||
decoder_input_lengths,
|
||||
cross_attention_mask,
|
||||
whisper_decoder_encoder_input_lengths,
|
||||
outputs,
|
||||
stream,
|
||||
) = inputs
|
||||
|
||||
hidden_states_dtype = lambda name: trt_dtype_to_torch(
|
||||
self.encoder_session.engine.get_tensor_dtype(name))
|
||||
|
||||
# input tensors
|
||||
inputs = {}
|
||||
if 'whisper' in self.model_name:
|
||||
inputs['input_features'] = encoder_input_ids.contiguous()
|
||||
inputs["input_lengths"] = encoder_input_lengths
|
||||
else:
|
||||
inputs["input_ids"] = encoder_input_ids.contiguous()
|
||||
inputs["input_lengths"] = encoder_input_lengths
|
||||
inputs["max_input_length"] = torch.empty(
|
||||
(self.max_input_len, ),
|
||||
dtype=hidden_states_dtype("max_input_length"),
|
||||
device=self.device,
|
||||
).contiguous()
|
||||
|
||||
if not self.encoder_model_config.gpt_attention_plugin:
|
||||
inputs["attention_mask"] = attention_mask.contiguous()
|
||||
|
||||
if self.encoder_model_config.has_position_embedding:
|
||||
bsz, seq_len = encoder_input_ids.shape[:2]
|
||||
position_ids = torch.arange(
|
||||
seq_len, dtype=torch.int32,
|
||||
device=encoder_input_ids.device).expand(bsz, -1)
|
||||
inputs['position_ids'] = position_ids.contiguous()
|
||||
|
||||
# run encoder
|
||||
self.encoder_session.set_shapes(inputs)
|
||||
ok = self.encoder_session.run(inputs, outputs, stream)
|
||||
assert ok, "Runtime execution failed"
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# run decoder
|
||||
sampling_config = tensorrt_llm.runtime.SamplingConfig(
|
||||
end_id=1, pad_id=0, num_beams=self.num_beams, min_length=output_len)
|
||||
encoder_output = outputs["encoder_output"]
|
||||
encoder_max_input_length = encoder_output.shape[
|
||||
1] if 'whisper' in self.model_name else torch.max(
|
||||
encoder_input_lengths).item()
|
||||
|
||||
self.decoder_session.setup(
|
||||
decoder_input_lengths.size(0),
|
||||
torch.max(decoder_input_lengths).item(),
|
||||
output_len,
|
||||
beam_width=self.num_beams,
|
||||
max_attention_window_size=None,
|
||||
encoder_max_input_length=encoder_max_input_length,
|
||||
)
|
||||
|
||||
self.decoder_session.decode(
|
||||
decoder_input_ids,
|
||||
decoder_input_lengths,
|
||||
sampling_config,
|
||||
encoder_output=encoder_output,
|
||||
encoder_input_lengths=whisper_decoder_encoder_input_lengths
|
||||
if 'whisper' in self.model_name else encoder_input_lengths,
|
||||
cross_attention_mask=cross_attention_mask,
|
||||
)
|
||||
|
||||
def report(self,
|
||||
config,
|
||||
latency,
|
||||
percentile95,
|
||||
percentile99,
|
||||
peak_gpu_used,
|
||||
csv,
|
||||
benchmark_profiler=None):
|
||||
# Note: Theoretically, the encoder and decoder can have different configs.
|
||||
# But for current implementation, we assume they are the same. In the future,
|
||||
# we can have a special structure of report_dict for enc-dec models.
|
||||
report_dict = super().get_report_dict()
|
||||
batch_size, encoder_input_len, output_len = config[0], config[
|
||||
1], config[2]
|
||||
tokens_per_sec = round(batch_size * output_len / (latency / 1000), 2)
|
||||
report_dict["num_heads"] = self.encoder_model_config.num_heads
|
||||
report_dict["num_kv_heads"] = self.encoder_model_config.num_kv_heads
|
||||
report_dict["num_layers"] = self.encoder_model_config.num_layers
|
||||
report_dict["hidden_size"] = self.encoder_model_config.hidden_size
|
||||
report_dict["vocab_size"] = self.encoder_model_config.vocab_size
|
||||
report_dict["batch_size"] = batch_size
|
||||
report_dict["input_length"] = encoder_input_len
|
||||
report_dict["output_length"] = output_len
|
||||
report_dict["gpu_weights_percent"] = config[3]
|
||||
report_dict["latency(ms)"] = latency
|
||||
report_dict["build_time(s)"] = self.build_time
|
||||
report_dict["tokens_per_sec"] = tokens_per_sec
|
||||
report_dict["percentile95(ms)"] = percentile95
|
||||
report_dict["percentile99(ms)"] = percentile99
|
||||
report_dict["gpu_peak_mem(gb)"] = peak_gpu_used
|
||||
if self.runtime_rank == 0:
|
||||
if csv:
|
||||
line = ",".join([str(v) for v in report_dict.values()])
|
||||
print(line)
|
||||
with open(self.get_csv_filename(), "a") as file:
|
||||
file.write(line + "\n")
|
||||
else:
|
||||
kv_pairs = [f"{k} {v}" for k, v in report_dict.items()]
|
||||
line = "[BENCHMARK] " + " ".join(kv_pairs)
|
||||
print(line)
|
||||
@ -1,291 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import json
|
||||
from math import ceil
|
||||
|
||||
import pandas as pd
|
||||
import tensorrt as trt
|
||||
import torch
|
||||
|
||||
import tensorrt_llm
|
||||
from tensorrt_llm.bindings import KVCacheType
|
||||
from tensorrt_llm.builder import Engine
|
||||
from tensorrt_llm.runtime import (ChatGLMGenerationSession, GenerationSession,
|
||||
SamplingConfig)
|
||||
|
||||
from base_benchmark import BaseBenchmark # isort:skip
|
||||
|
||||
|
||||
def element_size(dtype: str):
|
||||
str_to_size_in_bytes = dict(float16=2,
|
||||
float32=4,
|
||||
int64=8,
|
||||
int32=4,
|
||||
int8=1,
|
||||
bool=1,
|
||||
bfloat16=2,
|
||||
fp8=1)
|
||||
return str_to_size_in_bytes[dtype]
|
||||
|
||||
|
||||
class GPTBenchmark(BaseBenchmark):
|
||||
|
||||
def __init__(self, args, batch_sizes, in_out_lens, gpu_weights_percents,
|
||||
rank, world_size):
|
||||
super().__init__(args.engine_dir, args.model, args.dtype, rank,
|
||||
world_size)
|
||||
self.batch_sizes = batch_sizes
|
||||
self.in_out_lens = in_out_lens
|
||||
self.gpu_weights_percents = gpu_weights_percents
|
||||
self.num_beams = args.num_beams
|
||||
self.cuda_graph_mode = args.enable_cuda_graph
|
||||
self.dump_layer_info = args.dump_layer_info
|
||||
|
||||
# Get build configs from engine directory is done in base class
|
||||
# Deserialize engine from engine directory
|
||||
engine = Engine.from_dir(args.engine_dir, rank)
|
||||
engine_buffer = engine.engine
|
||||
assert engine_buffer is not None
|
||||
pretrained_config = engine.config.pretrained_config
|
||||
if pretrained_config.architecture == 'ChatGLMForCausalLM' and pretrained_config.chatglm_version in [
|
||||
'glm', 'chatglm'
|
||||
]:
|
||||
session_cls = ChatGLMGenerationSession
|
||||
else:
|
||||
session_cls = GenerationSession
|
||||
|
||||
if not hasattr(self, 'num_kv_heads') or self.num_kv_heads is None:
|
||||
self.num_kv_heads = self.num_heads
|
||||
|
||||
rnn_config_items = [
|
||||
'conv_kernel', 'layer_types', 'rnn_hidden_size', 'state_size',
|
||||
'state_dtype', 'rnn_head_size', 'rnn_conv_dim_size'
|
||||
]
|
||||
rnn_configs_kwargs = {}
|
||||
for item in rnn_config_items:
|
||||
if hasattr(self, item):
|
||||
rnn_configs_kwargs[item] = getattr(self, item)
|
||||
|
||||
kv_cache_type = KVCacheType.CONTINUOUS
|
||||
if hasattr(self, 'kv_cache_type'):
|
||||
kv_cache_type = KVCacheType(self.kv_cache_type)
|
||||
else:
|
||||
if hasattr(self, 'paged_kv_cache'):
|
||||
kv_cache_type = KVCacheType.PAGED if self.paged_kv_cache == True else KVCacheType.CONTINUOUS
|
||||
|
||||
model_config = tensorrt_llm.runtime.ModelConfig(
|
||||
max_batch_size=self.max_batch_size,
|
||||
max_beam_width=self.num_beams,
|
||||
vocab_size=self.vocab_size,
|
||||
num_layers=self.num_layers,
|
||||
num_heads=self.num_heads // self.world_size,
|
||||
num_kv_heads=ceil(self.num_kv_heads / self.world_size),
|
||||
hidden_size=self.hidden_size // self.world_size,
|
||||
gpt_attention_plugin=self.use_gpt_attention_plugin,
|
||||
kv_cache_type=kv_cache_type,
|
||||
paged_state=self.paged_state
|
||||
if hasattr(self, 'paged_state') else False,
|
||||
dtype=self.dtype,
|
||||
remove_input_padding=self.remove_input_padding,
|
||||
quant_mode=self.quant_mode,
|
||||
tokens_per_block=self.tokens_per_block if hasattr(
|
||||
self, 'tokens_per_block') else 32,
|
||||
mamba_conv1d_plugin=self.use_mamba_conv1d_plugin,
|
||||
gpu_weights_percent=list(sorted(gpu_weights_percents))[0],
|
||||
**rnn_configs_kwargs,
|
||||
)
|
||||
self.sampling_config = SamplingConfig(end_id=2, pad_id=0)
|
||||
self.decoder = session_cls(model_config,
|
||||
engine_buffer,
|
||||
self.runtime_mapping,
|
||||
cuda_graph_mode=self.cuda_graph_mode)
|
||||
|
||||
# Print context memory size for CI/CD to track.
|
||||
context_mem_size = self.decoder.context_mem_size
|
||||
print(
|
||||
f"Allocated {context_mem_size / 1048576.0:.2f} MiB for execution context memory."
|
||||
)
|
||||
|
||||
def get_config(self):
|
||||
for inlen, outlen in self.in_out_lens:
|
||||
if inlen > self.max_input_len or inlen + outlen > self.max_seq_len:
|
||||
print(
|
||||
f'[WARNING] check inlen({inlen}) <= max_inlen({self.max_input_len}) or '
|
||||
f'seqlen({inlen + outlen}) <= max_seq_len({self.max_seq_len}) failed, skipping.'
|
||||
)
|
||||
continue
|
||||
for batch_size in self.batch_sizes:
|
||||
if batch_size > self.max_batch_size:
|
||||
print(
|
||||
f'[WARNING] check batch_size({batch_size}) '
|
||||
f'<= max_batch_size({self.max_batch_size}) failed, skipping.'
|
||||
)
|
||||
continue
|
||||
for gpu_weights_percent in self.gpu_weights_percents:
|
||||
yield (batch_size, inlen, outlen, gpu_weights_percent)
|
||||
|
||||
def set_weight_streaming(self, config):
|
||||
gpu_weights_percent = config[3]
|
||||
self.decoder.runtime._set_weight_streaming(gpu_weights_percent)
|
||||
|
||||
def prepare_inputs(self, config):
|
||||
batch_size, inlen, outlen = config[0], config[1], config[2]
|
||||
input_ids = torch.randint(100, (batch_size, inlen)).int().cuda()
|
||||
input_lengths = torch.tensor([inlen
|
||||
for _ in range(batch_size)]).int().cuda()
|
||||
|
||||
self.decoder.setup(batch_size, inlen, outlen, beam_width=self.num_beams)
|
||||
return (input_ids, input_lengths)
|
||||
|
||||
def get_report_dict(self, benchmark_profiler=None):
|
||||
report_dict = super().get_report_dict(
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
if benchmark_profiler is not None:
|
||||
report_dict["generation_time(ms)"] = None
|
||||
report_dict["total_generated_tokens"] = None
|
||||
report_dict["generation_tokens_per_second"] = None
|
||||
return report_dict
|
||||
|
||||
def run(self, inputs, config, benchmark_profiler=None):
|
||||
batch_size, inlen, outlen = config[0], config[1], config[2]
|
||||
self.decoder.setup(batch_size, inlen, outlen, beam_width=self.num_beams)
|
||||
if self.remove_input_padding:
|
||||
self.decoder.decode_batch(inputs[0],
|
||||
self.sampling_config,
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
else:
|
||||
self.decoder.decode(inputs[0],
|
||||
inputs[1],
|
||||
self.sampling_config,
|
||||
benchmark_profiler=benchmark_profiler)
|
||||
torch.cuda.synchronize()
|
||||
|
||||
def report(self,
|
||||
config,
|
||||
latency,
|
||||
percentile95,
|
||||
percentile99,
|
||||
peak_gpu_used,
|
||||
csv,
|
||||
benchmark_profiler=None):
|
||||
report_dict = super().get_report_dict()
|
||||
batch_size, inlen, outlen, gpu_weights_percent = config[0], config[
|
||||
1], config[2], config[3]
|
||||
tokens_per_sec = round(batch_size * outlen / (latency / 1000), 2)
|
||||
report_dict["num_heads"] = self.num_heads
|
||||
report_dict["num_kv_heads"] = self.num_kv_heads
|
||||
report_dict["num_layers"] = self.num_layers
|
||||
report_dict["hidden_size"] = self.hidden_size
|
||||
report_dict["vocab_size"] = self.vocab_size
|
||||
report_dict["batch_size"] = batch_size
|
||||
report_dict["gpu_weights_percent"] = gpu_weights_percent
|
||||
report_dict["input_length"] = inlen
|
||||
report_dict["output_length"] = outlen
|
||||
report_dict["latency(ms)"] = latency
|
||||
report_dict["tokens_per_sec"] = tokens_per_sec
|
||||
report_dict["percentile95(ms)"] = percentile95
|
||||
report_dict["percentile99(ms)"] = percentile99
|
||||
report_dict["gpu_peak_mem(gb)"] = peak_gpu_used
|
||||
if benchmark_profiler is not None:
|
||||
iter_count = benchmark_profiler.get_aux_info('iter_count')
|
||||
generation_time_ms = benchmark_profiler.get_timer_value(
|
||||
'generation_time')
|
||||
generation_step_count = benchmark_profiler.get_aux_info(
|
||||
'generation_step_count')
|
||||
token_per_step = batch_size * self.num_beams
|
||||
total_tokens = generation_step_count * token_per_step
|
||||
report_dict["generation_time(ms)"] = round(
|
||||
generation_time_ms / iter_count, 3)
|
||||
report_dict["total_generated_tokens"] = total_tokens / iter_count
|
||||
tokens_per_second = round(
|
||||
total_tokens * 1000.0 / generation_time_ms, 3)
|
||||
report_dict["generation_tokens_per_second"] = tokens_per_second
|
||||
|
||||
if self.runtime_rank == 0:
|
||||
if csv:
|
||||
line = ",".join([str(v) for v in report_dict.values()])
|
||||
print(line)
|
||||
with open(self.get_csv_filename(), "a") as file:
|
||||
file.write(line + "\n")
|
||||
else:
|
||||
kv_pairs = [f"{k} {v}" for k, v in report_dict.items()]
|
||||
line = '[BENCHMARK] ' + " ".join(kv_pairs)
|
||||
print(line)
|
||||
|
||||
if self.dump_layer_info:
|
||||
engine_inspector = self.decoder.engine_inspector
|
||||
inspector_result = engine_inspector.get_engine_information(
|
||||
trt.LayerInformationFormat.JSON)
|
||||
json_result = json.loads(inspector_result)
|
||||
layers = json_result["Layers"]
|
||||
for layer_idx, _ in enumerate(layers):
|
||||
layer_info = engine_inspector.get_layer_information(
|
||||
layer_idx, trt.LayerInformationFormat.ONELINE)
|
||||
print(layer_info)
|
||||
|
||||
def report_profiler(self, benchmark_profiler=None):
|
||||
if benchmark_profiler is not None and benchmark_profiler.is_recording_perf_profile:
|
||||
perf_profile_data = self.decoder.profiler.results
|
||||
if not perf_profile_data:
|
||||
tensorrt_llm.logger.error("profiler data is empty")
|
||||
return
|
||||
|
||||
ctx_layers = list()
|
||||
generation_layers = list()
|
||||
start = 0
|
||||
ctx_iter_cnt = 0
|
||||
generation_iter_cnt = 0
|
||||
|
||||
# split context/generations layer information
|
||||
for idx, layer_info in enumerate(perf_profile_data):
|
||||
if layer_info[0] == "step":
|
||||
if layer_info[1] == 0:
|
||||
ctx_layers.extend(perf_profile_data[start:idx])
|
||||
ctx_iter_cnt += 1
|
||||
else:
|
||||
generation_layers.extend(perf_profile_data[start:idx])
|
||||
generation_iter_cnt += 1
|
||||
start = idx + 1
|
||||
|
||||
# Reduce all data
|
||||
def reduce_layer_data(layers):
|
||||
layer_infos = dict()
|
||||
for layer in layers:
|
||||
if layer[0] in layer_infos:
|
||||
layer_infos[layer[0]] += layer[1]
|
||||
else:
|
||||
layer_infos[layer[0]] = layer[1]
|
||||
return layer_infos
|
||||
|
||||
# Dump kernel data
|
||||
def dump_kernel_profile_table(name: str, profile_data: list,
|
||||
iter_cnt: int):
|
||||
table = pd.DataFrame(
|
||||
[['{:0.3f}'.format(v), k]
|
||||
for k, v in profile_data.items() if v != 0.0],
|
||||
columns=['times (ms)', '{} Phase LayerName'.format(name)])
|
||||
|
||||
def ljust(s):
|
||||
s = s.astype(str).str.strip()
|
||||
return s.str.ljust(s.str.len().max())
|
||||
|
||||
print(table.apply(ljust).to_string(index=False, justify='left'))
|
||||
print("{} phase step iter: {}".format(name, iter_cnt))
|
||||
|
||||
ctx_layer_infos = reduce_layer_data(ctx_layers)
|
||||
generation_layer_infos = reduce_layer_data(generation_layers)
|
||||
dump_kernel_profile_table("Context", ctx_layer_infos, ctx_iter_cnt)
|
||||
dump_kernel_profile_table("Generation", generation_layer_infos,
|
||||
generation_iter_cnt)
|
||||
@ -1,38 +0,0 @@
|
||||
# Benchmark Multi-user Multi-round Serving with Llama-3.1-70B
|
||||
|
||||
## Overview
|
||||
This benchmark is a multi-user, multi-round serving system designed to handle interactions with multiple users simultaneously, enabling a sequence of requests and responses in multiple rounds per user. It is suitable for applications like chatbots, customer support systems, or other interactive services where stateful conversations are required.
|
||||
|
||||
#### Application Setup
|
||||
Each user is assigned a unique long context prompt consisting of 16,000 tokens with precomputed kv_cache.
|
||||
|
||||
* First Round: The input includes the 16,000-token context prompt and an additional 64 new input tokens. The output length is limited to 64 tokens.
|
||||
* Subsequent Rounds: The input is formed by combining the previous input, the output tokens from the last round, and 64 new input tokens. The output length is limited to 64 tokens.
|
||||
|
||||
#### Benchmark Features
|
||||
This benchmark leverages kv_cache reuse and allocates host (CPU) memory as a secondary pool for kv_cache blocks. It measures the end-to-end runtime of 10 rounds, with user requests processed in a round-robin fashion. As the number of users increases, the kv_cache footprints exceed the GPU memory capacity. In such cases, less recently used cache blocks are offloaded to CPU memory and brought back to the GPU as needed for subsequent rounds.
|
||||
|
||||
Additionally, the benchmark tracks the Time to First Token (TTFT). Since each user’s long context prompt has precomputed kv_cache, a new request can reuse this cache while processing the additional input tokens, ensuring efficient response generation of the first output token.
|
||||
|
||||
#### Comparing GH200 and H100
|
||||
This benchmark highlights the potential of the NVIDIA GH200 in comparison to the H100. The GH200 utilizes NVIDIA NVLink-C2C to provide a CPU+GPU coherent memory model with 900 gigabytes per second (GB/s) memcpy throughput, which is 7x faster than the H100 connected via PCIe Gen5. GH200 also has larger on GPU memory. GH200 offers configurations of 96 GB or 144 GB while is equipped with 80 GB of GPU memory.
|
||||
|
||||
## Performance Comparison
|
||||
> NOTE: GH200 with 96 GB on GPU memory is used to generate the below results.
|
||||
|
||||

|
||||
|
||||
#### On-GPU kv_cache Storage:
|
||||
The H100 can support 2 concurrent users with on-GPU kv_cache storage, whereas the GH200 can support 7 concurrent users, leveraging its larger GPU memory capacity.
|
||||
|
||||
#### User Size = 2:
|
||||
At a user size of 2, kv_cache is fully stored in GPU memory for both H100 and GH200. Performance improvements in this scenario are unrelated to NVLink-C2C or the larger GPU memory size of the GH200.
|
||||
|
||||
#### User Sizes 3 to 7:
|
||||
The H100 must offload kv_cache to CPU memory and transfer precomputed blocks back to GPU when needed. This additional memory transfer introduces latency due to the slower communication between CPU and GPU. GH200 can handle kv_cache entirely in GPU memory, eliminating the need for memory transfers. Thus, GH200’s performance improvement peaks at a user size of 7.
|
||||
|
||||
#### User Size > 7:
|
||||
GH200 needs to utilize the CPU memory pool of kv_cache. The latency added is much less than H100 due to faster communication between CPU and GPU. GH200 delivers a 1.9x improvement in Time to First Token (TTFT) and approximately 3x improvement in end-to-end runtime over 10 rounds compared to the H100.
|
||||
|
||||
## Reproduction
|
||||
Use **run.sh** to reproduce the benchmark.
|
||||
@ -1,191 +0,0 @@
|
||||
import argparse
|
||||
import datetime
|
||||
import json
|
||||
import random
|
||||
import time
|
||||
|
||||
import tensorrt_llm.bindings.executor as trtllm
|
||||
|
||||
output_config = trtllm.OutputConfig()
|
||||
output_config.exclude_input_from_output = False
|
||||
sampling_config = trtllm.SamplingConfig(1)
|
||||
|
||||
|
||||
def generate_random_tokens(rounds=10, count=64) -> list[list[int]]:
|
||||
ret = []
|
||||
for i in range(rounds):
|
||||
ret.append([random.randint(0, 1000) for _ in range(count)])
|
||||
return ret
|
||||
|
||||
|
||||
# Read input tokens from json file
|
||||
def read_input_json(input_dataset_path: str,
|
||||
num_users) -> tuple[list[list[int]], list[int]]:
|
||||
with open(input_dataset_path, "r") as f:
|
||||
data = json.load(f)
|
||||
|
||||
input_tokens = []
|
||||
output_lens = []
|
||||
for n in range(num_users):
|
||||
sample = data["samples"][n]
|
||||
input_tokens.append(sample["input_ids"])
|
||||
output_lens.append(sample["output_len"])
|
||||
|
||||
return input_tokens, output_lens
|
||||
|
||||
|
||||
# Prepare and enqueue the requests
|
||||
def enqueue_requests(args: argparse.Namespace, executor: trtllm.Executor,
|
||||
input_tokens) -> list[int]:
|
||||
request_ids = []
|
||||
for tokens in input_tokens:
|
||||
req = trtllm.Request(input_token_ids=tokens,
|
||||
max_tokens=args.output_len,
|
||||
streaming=False,
|
||||
sampling_config=sampling_config,
|
||||
output_config=output_config)
|
||||
req_id = executor.enqueue_request(req)
|
||||
request_ids.append(req_id)
|
||||
|
||||
return request_ids
|
||||
|
||||
|
||||
def get_TTFT(stats_queue):
|
||||
iter_latency = []
|
||||
cache_hit_rates = []
|
||||
for stats in stats_queue:
|
||||
iter_latency.append(stats.iter_latency_ms)
|
||||
cache_hit_rates.append(stats.kv_cache_stats.cache_hit_rate)
|
||||
|
||||
TTFT_idx = [i for i, x in enumerate(cache_hit_rates) if x > 0.01][1]
|
||||
return iter_latency[TTFT_idx]
|
||||
|
||||
|
||||
# Wait for responses and store output tokens
|
||||
def wait_for_responses(args: argparse.Namespace, request_ids: list[int],
|
||||
executor: trtllm.Executor) -> list[list[int]]:
|
||||
|
||||
output_tokens = {req_id: [] for req_id in request_ids}
|
||||
num_finished = 0
|
||||
iterations = 0
|
||||
while (num_finished < len(request_ids) and iterations < args.timeout_ms):
|
||||
responses = executor.await_responses(
|
||||
datetime.timedelta(milliseconds=args.timeout_ms))
|
||||
for response in responses:
|
||||
req_id = response.request_id
|
||||
if not response.has_error():
|
||||
result = response.result
|
||||
num_finished += 1 if result.is_final else 0
|
||||
for _, outTokens in enumerate(result.output_token_ids):
|
||||
output_tokens[req_id].extend(outTokens)
|
||||
else:
|
||||
raise RuntimeError(
|
||||
str(req_id) + " encountered error:" + response.error_msg)
|
||||
|
||||
return list(output_tokens.values())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Executor Bindings Example")
|
||||
parser.add_argument("--n", type=int, required=True, help="Number of users")
|
||||
parser.add_argument("--free_gpu_memory_fraction",
|
||||
required=False,
|
||||
type=float,
|
||||
default=0.9,
|
||||
help="free_gpu_memory_fraction")
|
||||
parser.add_argument("--kv_host_cache_bytes",
|
||||
required=False,
|
||||
type=int,
|
||||
default=55000000000,
|
||||
help="host_cache_size")
|
||||
parser.add_argument("--model_path",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Directory containing model engine")
|
||||
parser.add_argument("--input_dataset_path",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Text file containing the input tokens")
|
||||
parser.add_argument("--beam_width",
|
||||
type=int,
|
||||
required=False,
|
||||
default=1,
|
||||
help="The beam width")
|
||||
parser.add_argument("--streaming",
|
||||
default=False,
|
||||
action="store_true",
|
||||
help="Operate in streaming mode")
|
||||
parser.add_argument("--output_len",
|
||||
type=int,
|
||||
required=False,
|
||||
default=64,
|
||||
help="The number of tokens to be generated for output.")
|
||||
parser.add_argument("--rounds",
|
||||
type=int,
|
||||
required=False,
|
||||
default=10,
|
||||
help="How many runs of user input to run.")
|
||||
parser.add_argument(
|
||||
"--timeout_ms",
|
||||
type=int,
|
||||
required=False,
|
||||
default=10000,
|
||||
help="The maximum time to wait for all responses, in milliseconds")
|
||||
parser.add_argument(
|
||||
"--log_iteration_data",
|
||||
action='store_true',
|
||||
help="Print the verbose iteration status data (default: False).")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
kv_cache_config = trtllm.KvCacheConfig(
|
||||
enable_block_reuse=True,
|
||||
free_gpu_memory_fraction=args.free_gpu_memory_fraction,
|
||||
host_cache_size=args.kv_host_cache_bytes)
|
||||
|
||||
executor_config = trtllm.ExecutorConfig(args.beam_width,
|
||||
kv_cache_config=kv_cache_config)
|
||||
|
||||
# Create the executor.
|
||||
executor = trtllm.Executor(args.model_path, trtllm.ModelType.DECODER_ONLY,
|
||||
executor_config)
|
||||
|
||||
new_inputs = [generate_random_tokens(args.rounds) for _ in range(args.n)]
|
||||
stats_queue = []
|
||||
|
||||
if executor.can_enqueue_requests():
|
||||
## Process long context to generate kvcache
|
||||
context_tokens, _ = read_input_json(args.input_dataset_path, args.n)
|
||||
|
||||
# Enqueue the requests
|
||||
request_ids = enqueue_requests(args, executor, context_tokens)
|
||||
|
||||
# Wait for the responses
|
||||
output_tokens = wait_for_responses(args, request_ids, executor)
|
||||
|
||||
stats_queue.extend(executor.get_latest_iteration_stats())
|
||||
|
||||
# Start the multi-turn runs
|
||||
## Start timing
|
||||
start_time = time.time()
|
||||
|
||||
for r in range(args.rounds):
|
||||
current_input_tokens = [
|
||||
output_tokens[i] + new_inputs[i][r] for i in range(args.n)
|
||||
]
|
||||
# Enqueue the requests
|
||||
request_ids = enqueue_requests(args, executor, current_input_tokens)
|
||||
|
||||
# Wait for the responses
|
||||
output_tokens = wait_for_responses(args, request_ids, executor)
|
||||
|
||||
stats_queue.extend(executor.get_latest_iteration_stats())
|
||||
## End timing
|
||||
end_time = time.time()
|
||||
elapsed_time = (end_time - start_time) * 1000
|
||||
print(f"E2E TIME: {elapsed_time:.2f} (ms)")
|
||||
print(f"TTFT: {get_TTFT(stats_queue)} (ms)")
|
||||
|
||||
if args.log_iteration_data:
|
||||
for stats in stats_queue:
|
||||
print(stats.to_json_str())
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 94 KiB |
@ -1,49 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Check if the environment variable is set
|
||||
if [[ -z "${HUGGING_FACE_HUB_TOKEN}" ]]; then
|
||||
echo "The environment variable HUGGING_FACE_HUB_TOKEN is not set."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Get GPU name using nvidia-smi
|
||||
gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader)
|
||||
|
||||
GPU="GH200"
|
||||
|
||||
# Check if the GPU is a GH200
|
||||
if echo "$gpu_name" | grep -q "GH200"; then
|
||||
GPU="GH200"
|
||||
else
|
||||
GPU="H100"
|
||||
fi
|
||||
|
||||
echo "Running with ${GPU}."
|
||||
|
||||
# Generate context prompts of 16,000 tokens for each user
|
||||
python3 $(pwd)/../../cpp/prepare_dataset.py \
|
||||
--output=$(pwd)/dataset.json \
|
||||
--tokenizer=meta-llama/Llama-3.1-70B token-norm-dist \
|
||||
--num-requests=20 \
|
||||
--input-mean=16000 \
|
||||
--output-mean=64 \
|
||||
--input-stdev=0 \
|
||||
--output-stdev=0
|
||||
|
||||
# Build the model
|
||||
trtllm-bench --workspace $(pwd)/${GPU} \
|
||||
--model meta-llama/Llama-3.1-70B \
|
||||
build \
|
||||
--max_batch_size 16 \
|
||||
--max_num_tokens 17800 \
|
||||
--max_seq_len 17800 \
|
||||
--quantization FP8
|
||||
|
||||
# Run the benchmark script
|
||||
for user_size in $(seq 2 16); do
|
||||
echo "Run benchmark with user size = ${user_size}."
|
||||
python3 benchmark.py \
|
||||
--model_path $(pwd)/${GPU}/meta-llama/Llama-3.1-70B/tp_1_pp_1 \
|
||||
--input_dataset_path dataset.json \
|
||||
--n ${user_size}
|
||||
done
|
||||
@ -1,92 +0,0 @@
|
||||
# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from multiprocessing import Event, Process, Queue
|
||||
from queue import Empty
|
||||
|
||||
from tensorrt_llm.logger import logger
|
||||
from tensorrt_llm.profiler import (MemUnitType, bytes_to_target_unit,
|
||||
device_memory_info, host_memory_info)
|
||||
|
||||
|
||||
class MemoryMonitor:
|
||||
|
||||
def __init__(self, query_interval=0.1, disable_host_mem_monitor=False):
|
||||
self.query_interval = query_interval # second(s)
|
||||
self.mem_monitor_process = None
|
||||
# bytes
|
||||
self._peak_host_memory = 0
|
||||
self._peak_device_memory = 0
|
||||
|
||||
self.pid = os.getpid()
|
||||
self.device_handles = {}
|
||||
|
||||
self.signal_event = Event() # Sending signal to subprocess
|
||||
self.peak_mem_queue = Queue() # Receiving results from subprocess
|
||||
|
||||
self.disable_host_mem_monitor = disable_host_mem_monitor
|
||||
|
||||
def start(self):
|
||||
self.mem_monitor_process = Process(target=self._upd_peak_memory_usage,
|
||||
args=(self.signal_event,
|
||||
self.peak_mem_queue))
|
||||
self.mem_monitor_process.start()
|
||||
logger.debug("Launched memory monitor subprocess.")
|
||||
|
||||
def kill(self):
|
||||
if self.mem_monitor_process is not None:
|
||||
self.mem_monitor_process.kill()
|
||||
logger.debug("Memory monitor subprocess is killed.")
|
||||
|
||||
def stop(self):
|
||||
self.signal_event.set()
|
||||
logger.debug("Sent signal to stop memory monitor subprocess.")
|
||||
|
||||
try:
|
||||
peak_mem_use = self.peak_mem_queue.get(timeout=20)
|
||||
except Empty:
|
||||
logger.warning("peak_mem_queue was empty.")
|
||||
else:
|
||||
self._peak_host_memory = max(self._peak_host_memory,
|
||||
peak_mem_use[0])
|
||||
self._peak_device_memory = max(self._peak_device_memory,
|
||||
peak_mem_use[1])
|
||||
|
||||
self.mem_monitor_process.join(timeout=20)
|
||||
self.mem_monitor_process = None
|
||||
logger.debug("Memory monitor subprocess joined.")
|
||||
self.peak_mem_queue.close()
|
||||
self.peak_mem_queue.join_thread()
|
||||
logger.debug("Peak memory queue closed and joined.")
|
||||
|
||||
def _upd_peak_memory_usage(self, signal_event, peak_mem_queue):
|
||||
peak_host_used, peak_device_used = self.get_memory_usage()
|
||||
while not signal_event.is_set():
|
||||
host_used, device_used = self.get_memory_usage()
|
||||
peak_host_used = max(host_used, peak_host_used)
|
||||
peak_device_used = max(device_used, peak_device_used)
|
||||
peak_mem_queue.put((peak_host_used, peak_device_used))
|
||||
|
||||
def get_memory_usage(self):
|
||||
if self.disable_host_mem_monitor:
|
||||
host_used = 0
|
||||
else:
|
||||
host_used, _, _ = host_memory_info(self.pid)
|
||||
device_used, _, _ = device_memory_info()
|
||||
return host_used, device_used
|
||||
|
||||
def get_peak_memory_usage(self, unit: MemUnitType = 'GiB'):
|
||||
return bytes_to_target_unit(self._peak_host_memory, unit), \
|
||||
bytes_to_target_unit(self._peak_device_memory, unit)
|
||||
@ -38,18 +38,6 @@ python3 examples/summarize.py \
|
||||
|
||||
```
|
||||
|
||||
We can also benchmark the efficiency of Weight Streaming. Here is an example:
|
||||
```bash
|
||||
python3 benchmarks/python/benchmark.py \
|
||||
--engine_dir /tmp/llama_7b/trt_engines/fp16/1-gpu/ \
|
||||
--batch_size "1;32" \
|
||||
--input_output_len "256,32" \
|
||||
--gpu_weights_percent "0.0;0.3;0.6;1.0" \
|
||||
--dtype float16 \
|
||||
--csv \
|
||||
--log_level verbose
|
||||
```
|
||||
|
||||
|
||||
### API Changes
|
||||
|
||||
|
||||
@ -18,12 +18,12 @@ This document shows how to build and run an Encoder-Decoder (Enc-Dec) model in T
|
||||
- [Run Python runtime](#run-python-runtime)
|
||||
- [Benchmark](#benchmark)
|
||||
- [Benchmark C++ runtime](#benchmark-c-runtime)
|
||||
- [Benchmark Python runtime](#benchmark-python-runtime)
|
||||
- [Run BART with LoRA](#run-bart-with-lora)
|
||||
- [Reminders](#reminders)
|
||||
- [Attention Scaling Factors](#attention-scaling-factors)
|
||||
- [Run FairSeq NMT (Neural Machine Translation) models](#run-fairseq-nmt-neural-machine-translation-models)
|
||||
- [FP8 Post-Training Quantization](#fp8-post-training-quantization)
|
||||
- [Get quantized checkpoint with ModelOpt](#get-quantized-checkpoint-with-modelopt)
|
||||
|
||||
## Overview
|
||||
|
||||
@ -241,31 +241,6 @@ mpirun --allow-run-as-root -np ${WORLD_SIZE} python3 run.py --engine_dir tmp/trt
|
||||
|
||||
The tutorial for encoder-decoder C++ runtime benchmark can be found in [`benchmarks/cpp`](../../benchmarks/cpp/README.md#2-launch-c-benchmarking-inflightv1-batching)
|
||||
|
||||
#### Benchmark Python runtime
|
||||
|
||||
The benchmark implementation and entrypoint can be found in [`benchmarks/python/benchmark.py`](../../benchmarks/python/benchmark.py). Specifically, [`benchmarks/python/enc_dec_benchmark.py`](../../benchmarks/python/enc_dec_benchmark.py) is the benchmark script for Encoder-Decoder models.
|
||||
|
||||
In `benchmarks/python/`:
|
||||
|
||||
```bash
|
||||
# Example 1: Single-GPU benchmark
|
||||
python benchmark.py \
|
||||
-m enc-dec \
|
||||
--batch_size "1;8" \
|
||||
--input_output_len "60,20;128,20" \
|
||||
--engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION} \
|
||||
--dtype float32 \
|
||||
--csv # optional
|
||||
|
||||
# Example 2: Multi-GPU benchmark
|
||||
mpirun --allow-run-as-root -np 4 python benchmark.py \
|
||||
-m enc-dec \
|
||||
--batch_size "1;8" \
|
||||
--input_output_len "60,20;128,20" \
|
||||
--engine_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION} \
|
||||
--dtype float32 \
|
||||
--csv # optional
|
||||
```
|
||||
|
||||
### Run BART with LoRA
|
||||
|
||||
|
||||
@ -28,14 +28,9 @@
|
||||
"accuracy/test_cli_flow.py::TestLlama3_8BInstruct::test_nvfp4": 286.4440165119886,
|
||||
"perf/test_perf.py::test_perf[bert_base-cpp-ootb-float16-bs:32-input_len:32]": 111.37450777366757,
|
||||
"perf/test_perf.py::test_perf[bert_base-cpp-plugin-float16-bs:32-input_len:32]": 95.00738414749503,
|
||||
"perf/test_perf.py::test_perf[bert_base-ootb-float16-bs:32-input_len:32]": 132.52322902716696,
|
||||
"perf/test_perf.py::test_perf[bert_base-plugin-float16-bs:32-input_len:32]": 114.33938522078097,
|
||||
"perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-bs:32-input_output_len:60": 99.74059158749878,
|
||||
"perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-gwp:0.0-bs:32-input_output_len:60": 98.94526879303157,
|
||||
"perf/test_perf.py::test_perf[gpt_350m-cppmanager-static_batching-plugin_ifb-float16-bs:32-input_output_len:60": 100.77929892018437,
|
||||
"perf/test_perf.py::test_perf[gpt_350m-ootb-float16-bs:32-input_output_len:60": 170.83428032323718,
|
||||
"perf/test_perf.py::test_perf[gpt_350m-ootb-float16-gwp:0.5-bs:32-input_output_len:60": 173.8481143657118,
|
||||
"perf/test_perf.py::test_perf[gpt_350m-plugin-float16-bs:32-input_output_len:60": 217.20630648359656,
|
||||
"perf/test_perf.py::test_perf[roberta_base-cpp-plugin-float16-bs:32-input_len:128+512]": 140.2516261599958,
|
||||
"accuracy/test_cli_flow.py::TestGemma2_9BIt::test_auto_dtype": 725.8308991710655,
|
||||
"accuracy/test_cli_flow.py::TestGpt2::test_attention_ootb": 448.54090467840433,
|
||||
@ -61,7 +56,6 @@
|
||||
"examples/test_multimodal.py::test_llm_multimodal_general[fuyu-8b-pp:1-tp:1-float16-bs:1-cpp_e2e:True-nb:1]": 492.22362083010375,
|
||||
"examples/test_multimodal.py::test_llm_multimodal_general[kosmos-2-pp:1-tp:1-float16-bs:1-cpp_e2e:True-nb:1]": 333.81485258904286,
|
||||
"examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb5-bs8]": 411.88197461143136,
|
||||
"test_e2e.py::test_benchmark_sanity_enable_fp8[gpt_350m]": 246.73502164706588,
|
||||
"test_unittests.py::test_unittests_v2[unittest/trt/model_api/test_model_quantization.py]": 493.8186915554106,
|
||||
"accuracy/test_cli_flow.py::TestGpt2::test_beam_search_large": 730.1395341157913,
|
||||
"accuracy/test_cli_flow.py::TestVicuna7B::test_eagle[cuda_graph=False-chunked_context=False-typical_acceptance=False]": 422.75362031999975,
|
||||
@ -118,7 +112,6 @@
|
||||
"examples/test_redrafter.py::test_llm_redrafter_1gpu[use_cpp_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb5-bs8]": 386.68252966180444,
|
||||
"examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb8-bs8]": 429.239758990705,
|
||||
"examples/test_whisper.py::test_llm_whisper_general[large-v3-disable_gemm_plugin-disable_attention_plugin-disable_weight_only-float16-nb:1-use_python_runtime]": 327.95307156071067,
|
||||
"test_e2e.py::test_benchmark_sanity_enable_fp8[llama_7b]": 253.08591708587483,
|
||||
"test_e2e.py::test_build_time_benchmark_sanity": 165.71592589840293,
|
||||
"test_unittests.py::test_unittests_v2[unittest/trt/attention/test_bert_attention.py]": 99.96196278184652,
|
||||
"cpp/test_e2e.py::test_benchmarks[gpt-80]": 1376.0404928650241,
|
||||
|
||||
@ -68,7 +68,7 @@ class SanityPerfCheck():
|
||||
cleaned_options = []
|
||||
for option in options:
|
||||
# Truncate workspace dir
|
||||
if "build.py" in option or "benchmark.py" in option or "SessionBenchmark.cpp" in option:
|
||||
if "build.py" in option or "SessionBenchmark.cpp" in option:
|
||||
cleaned_options.append("/".join(
|
||||
option.split("/")[-5:]))
|
||||
# Remove engine_dir as it is not useful
|
||||
|
||||
@ -481,11 +481,9 @@ class PerfTestConfig:
|
||||
labels = test_param_labels.split("-")
|
||||
|
||||
self.model_name = labels.pop(0)
|
||||
self.runtime = "python" if labels[0] not in [
|
||||
"cpp",
|
||||
"cppmanager",
|
||||
"bench",
|
||||
] else labels.pop(0)
|
||||
assert labels[0] in ["cpp", "cppmanager", "bench"], \
|
||||
f"Invalid runtime {labels[0]}!"
|
||||
self.runtime = labels.pop(0)
|
||||
self.api = labels.pop(0) if labels[0] == "exe" else ""
|
||||
self.backend = labels.pop(0) if labels[0] == "pytorch" else ""
|
||||
self.streaming = labels.pop(0) if labels[0] == "streaming" else ""
|
||||
@ -592,7 +590,7 @@ class PerfTestConfig:
|
||||
assert self.model_name in allowed_models, f"model_name {self.model_name} is not in allowed_models!"
|
||||
|
||||
# Validate runtime type.
|
||||
VALID_RUNTIMES = ["cpp", "cppmanager", "python", "bench"]
|
||||
VALID_RUNTIMES = ["cpp", "cppmanager", "bench"]
|
||||
assert self.runtime in VALID_RUNTIMES, f"Invalid runtime {self.runtime}!"
|
||||
|
||||
# Validate plugin mode.
|
||||
@ -775,8 +773,7 @@ class MultiMetricPerfTest(AbstractPerfScriptTestClass):
|
||||
elif self._config.runtime == "bench":
|
||||
benchmark_script = "trtllm-bench"
|
||||
else:
|
||||
benchmark_script = os.path.join(llm_root, "benchmarks", "python",
|
||||
"benchmark.py")
|
||||
raise RuntimeError(f"Invalid runtime {self._config.runtime}.")
|
||||
allowed_configs = import_allowed_perf_config()
|
||||
allowed_models = allowed_configs.get_allowed_models()
|
||||
if self._config.runtime == "bench":
|
||||
|
||||
@ -25,13 +25,13 @@ import pytest
|
||||
import yaml
|
||||
from defs.common import convert_weights
|
||||
from defs.trt_test_alternative import (check_call, check_call_negative_test,
|
||||
check_output, exists, makedirs)
|
||||
check_output)
|
||||
|
||||
from .common import (PluginOptions, convert_weights, prune_checkpoint,
|
||||
quantize_data, refit_model, venv_check_call)
|
||||
from .conftest import (llm_models_root, skip_nvlink_inactive,
|
||||
skip_post_blackwell, skip_pre_ada, skip_pre_blackwell,
|
||||
skip_pre_hopper, tests_path, unittest_path)
|
||||
skip_post_blackwell, skip_pre_blackwell, skip_pre_hopper,
|
||||
tests_path, unittest_path)
|
||||
|
||||
sys.path.append(os.path.join(str(tests_path()), '/../examples/apps'))
|
||||
|
||||
@ -742,79 +742,6 @@ def test_trtllm_bench_iteration_log(llm_root, llm_venv, model_name,
|
||||
shutil.rmtree(engine_dir, ignore_errors=True)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("model_name", [
|
||||
"gpt_350m", "gpt_350m_sq_per_tensor", "llama_70b", "bert_base",
|
||||
"falcon_40b", "t5_base", "roberta_base"
|
||||
],
|
||||
ids=lambda x: x.strip("-"))
|
||||
def test_benchmark_sanity(llm_root, llm_venv, model_name, engine_dir):
|
||||
'''
|
||||
sanity check on the benchmark script to make sure it works
|
||||
- gpt_350m for gpt baseline.
|
||||
- gpt_350m_sq_per_tensor for testing SQ
|
||||
- llama_70b for GQA (num_kv_heads < num_heads) in gpt benchmark script.
|
||||
- bert_base for bert baseline.
|
||||
- t5_base for t5 baseline.
|
||||
'''
|
||||
build_script_root = os.path.join(llm_root, "tests/integration/defs/perf")
|
||||
benchmark_root = os.path.join(llm_root, "benchmarks", "python")
|
||||
engine_dir = os.path.join(engine_dir, model_name, "benchmark-sanity")
|
||||
if not exists(engine_dir):
|
||||
makedirs(engine_dir)
|
||||
|
||||
# max batch size 256 (default) is OOM on A30, changing to a smaller one to just test sanity
|
||||
build_args = f"-m {model_name} --force_num_layer_1 --max_input_len 512 --max_batch_size 8"
|
||||
# test OOTB path in one of the model
|
||||
if model_name == "gpt_350m":
|
||||
build_args += " --mode ootb"
|
||||
build_cmd = f'{build_script_root}/build.py --output_dir {engine_dir} {build_args}'.split(
|
||||
" ")
|
||||
|
||||
benchmark_args = f"--batch_size 1;2 --duration 0 --num_runs 1"
|
||||
if 'bert' in model_name:
|
||||
benchmark_args += " --input_len 20;60"
|
||||
benchmark_args += " --m enc"
|
||||
else:
|
||||
benchmark_args += " --input_output_len 20,60;60,20"
|
||||
if 't5' in model_name or 'roberta' in model_name:
|
||||
benchmark_args += " --m enc-dec"
|
||||
load_cmd = f'{benchmark_root}/benchmark.py --engine_dir {engine_dir} {benchmark_args}'.split(
|
||||
" ")
|
||||
|
||||
venv_check_call(llm_venv, build_cmd)
|
||||
venv_check_call(llm_venv, load_cmd)
|
||||
|
||||
|
||||
@skip_pre_ada
|
||||
@pytest.mark.parametrize("model_name",
|
||||
["llama_7b", "gptj_6b", "gpt_350m", "falcon_40b"],
|
||||
ids=lambda x: x.strip("-"))
|
||||
def test_benchmark_sanity_enable_fp8(llm_root, llm_venv, model_name,
|
||||
engine_dir):
|
||||
'''
|
||||
sanity check on the benchmark script to make sure it works
|
||||
'''
|
||||
build_script_root = os.path.join(llm_root, "tests/integration/defs/perf")
|
||||
benchmark_root = os.path.join(llm_root, "benchmarks", "python")
|
||||
engine_dir = os.path.join(engine_dir, model_name, "benchmark-sanity")
|
||||
if not exists(engine_dir):
|
||||
makedirs(engine_dir)
|
||||
build_args = f"-m {model_name} --force_num_layer_1 --quantization fp8"
|
||||
build_cmd = f'{build_script_root}/build.py --output_dir {engine_dir} {build_args}'.split(
|
||||
" ")
|
||||
|
||||
benchmark_args = f"--batch_size 1;2 --duration 0 --num_runs 1 --quantization fp8"
|
||||
if 'bert' in model_name:
|
||||
benchmark_args += " --input_len 20;60"
|
||||
benchmark_args += " --m enc"
|
||||
else:
|
||||
benchmark_args += " --input_output_len 20,60;60,20"
|
||||
load_cmd = f'{benchmark_root}/benchmark.py --engine_dir {engine_dir} {benchmark_args}'.split(
|
||||
" ")
|
||||
venv_check_call(llm_venv, build_cmd)
|
||||
venv_check_call(llm_venv, load_cmd)
|
||||
|
||||
|
||||
def test_chatglm_6b_sanity(chatglm_6b_example_root, llm_venv, cmodel_dir,
|
||||
engine_dir):
|
||||
llm_models = llm_models_root()
|
||||
|
||||
@ -455,14 +455,6 @@ accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8_block_scales[laten
|
||||
accuracy/test_disaggregated_serving.py::TestLlama3_1_8B::test_auto_dtype[False]
|
||||
accuracy/test_disaggregated_serving.py::TestLlama3_1_8B::test_auto_dtype[True]
|
||||
|
||||
test_e2e.py::test_benchmark_sanity[bert_base] # 127.18s
|
||||
test_e2e.py::test_benchmark_sanity[gpt_350m] # 64.06s
|
||||
test_e2e.py::test_benchmark_sanity[gpt_350m_sq_per_tensor] # 97.04s
|
||||
test_e2e.py::test_benchmark_sanity[llama_70b] # 91.93s
|
||||
test_e2e.py::test_benchmark_sanity[roberta_base]
|
||||
test_e2e.py::test_benchmark_sanity[t5_base]
|
||||
test_e2e.py::test_benchmark_sanity_enable_fp8[gpt_350m]
|
||||
test_e2e.py::test_benchmark_sanity_enable_fp8[llama_7b]
|
||||
test_e2e.py::test_llama_e2e[use_cpp_session-remove_input_padding-]
|
||||
test_e2e.py::test_llama_e2e[use_py_session-remove_input_padding-]
|
||||
test_e2e.py::test_llama_e2e[use_py_session--]
|
||||
|
||||
@ -146,8 +146,6 @@ l0_a10:
|
||||
- examples/test_redrafter.py::test_llm_redrafter_1gpu[use_cpp_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb8-bs8]
|
||||
- examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb5-bs8]
|
||||
- examples/test_redrafter.py::test_llm_redrafter_1gpu[use_py_session-redrafter-vicuna-7b-v1.3-bfloat16-dl5-nb8-bs8]
|
||||
- test_e2e.py::test_benchmark_sanity[bert_base]
|
||||
- test_e2e.py::test_benchmark_sanity[roberta_base]
|
||||
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba-130m-float16-disable_gemm_plugin] # 3 mins
|
||||
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba2-130m-float16-disable_gemm_plugin]
|
||||
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba-codestral-7B-v0.1-float16-disable_gemm_plugin] # 4 mins
|
||||
@ -162,7 +160,6 @@ l0_a10:
|
||||
- examples/test_mamba.py::test_llm_mamba_1gpu[mamba-130m-float16-enable_gemm_plugin] # 2 mins
|
||||
- llmapi/test_llm_e2e.py::test_llmapi_load_engine_from_build_command[llama-codellama/CodeLlama-7b-Instruct-hf] # 5min
|
||||
- llmapi/test_llm_e2e.py::test_llmapi_load_ckpt_from_convert_command # 5min
|
||||
- test_e2e.py::test_benchmark_sanity[t5_base]
|
||||
- examples/test_openai.py::test_llm_openai_triton_1gpu
|
||||
- examples/test_openai.py::test_llm_openai_triton_plugingen_1gpu
|
||||
- test_e2e.py::test_build_time_benchmark_sanity
|
||||
|
||||
@ -277,7 +277,5 @@ l0_h100:
|
||||
- examples/test_gpt.py::test_llm_minitron_fp8_with_pseudo_loras[4b]
|
||||
- examples/test_chatglm.py::test_llm_glm_4_9b_single_gpu_summary[glm-4-9b-disable_weight_only]
|
||||
- unittest/trt/model_api/test_model_quantization.py # 20 mins on H100
|
||||
- test_e2e.py::test_benchmark_sanity_enable_fp8[llama_7b] # 55.77s H100 only
|
||||
- test_e2e.py::test_benchmark_sanity_enable_fp8[gpt_350m] # 34.07s H100 only
|
||||
- unittest/bindings # 8 mins on H100
|
||||
- test_e2e.py::test_build_time_benchmark_sanity
|
||||
|
||||
@ -14,19 +14,11 @@ l0_perf:
|
||||
stage: pre_merge
|
||||
backend: tensorrt
|
||||
tests:
|
||||
- perf/test_perf.py::test_perf[bert_base-plugin-float16-bs:32-input_len:32]
|
||||
- perf/test_perf.py::test_perf[bert_base-cpp-plugin-float16-bs:32-input_len:32]
|
||||
- perf/test_perf.py::test_perf[bert_base-ootb-float16-bs:32-input_len:32]
|
||||
- perf/test_perf.py::test_perf[bert_base-cpp-ootb-float16-bs:32-input_len:32]
|
||||
- perf/test_perf.py::test_perf[roberta_base-cpp-plugin-float16-bs:32-input_len:128+512]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-plugin-float16-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-ootb-float16-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-ootb-float16-gwp:0.5-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin_ifb-float16-gwp:0.0-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-static_batching-plugin_ifb-float16-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-plugin-float16-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[gpt_350m-cppmanager-static_batching-plugin-float16-bs:32-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[t5_base-plugin-float16-bs:8-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[flan_t5_base-plugin-float16-bs:8-input_output_len:60,20]
|
||||
- perf/test_perf.py::test_perf[bart_large_cnn-plugin-float16-bs:8-input_output_len:60,20]
|
||||
|
||||
@ -85,8 +85,6 @@ full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-mini-
|
||||
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3-small-8k-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-instruct-bfloat16-enable_gemm_plugin-enable_attention_plugin-enable_fmha_with_fp32_acc-nb:1] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[roberta_base] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/functional SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/quantization SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
|
||||
@ -102,7 +100,6 @@ full:B200_PCIe/examples/test_medusa.py::test_llm_medusa_with_qaunt_base_model_1g
|
||||
full:B200_PCIe/unittest/bindings SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/attention/test_sage_attention.py unittest/llmapi/test_llm_download.py unittest/llmapi/test_llm_kv_cache_events.py unittest/llmapi/test_mpi_session.py unittest/trt/model/redrafter unittest/trt/model/test_phi.py unittest/trt/model/test_unet.py unittest/trt/python_plugin unittest/tools unittest/utils unittest/others SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/test_e2e.py::test_bert_e2e SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/quantization/test_weight_only_quant_matmul.py SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/unittest/trt/quantization/test_weight_only_groupwise_quant_matmul.py SKIP (Disable for Blackwell)
|
||||
full:B200_PCIe/examples/test_gpt.py::test_llm_gpt2_starcoder_weight_only[starcoder2-int8-float16] SKIP (Disable for Blackwell)
|
||||
@ -137,7 +134,6 @@ full:B200_PCIe/examples/test_nemotron.py::test_llm_nemotron_3_8b_1gpu[bfloat16-f
|
||||
full:B200_PCIe/accuracy/test_cli_flow.py::TestMixtral8x7B::test_fp4_plugin SKIP (Disable for Blackwell OOM)
|
||||
full:B200_PCIe/examples/test_commandr.py::test_llm_commandr_v01_single_gpu_summary[disable_weight_only] SKIP (Disable for Blackwell OOM)
|
||||
full:B200_PCIe/unittest/llmapi/test_llm_models.py -m "not (part0 or part1)" SKIP (Disable for Blackwell OOM)
|
||||
full:B200_PCIe/test_e2e.py::test_benchmark_sanity[t5_base] SKIP (Disable for Blackwell for custom mask input)
|
||||
|
||||
full:B200/examples/test_llama.py::test_llm_llama_v2_1gpu_auto_parallel[llama-v2-7b-hf] SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_mamba.py::test_llm_mamba_1gpu[mamba2-130m-float16-enable_gemm_plugin] SKIP (Disable for Blackwell)
|
||||
@ -180,8 +176,6 @@ full:B200/examples/test_phi.py::test_llm_phi_single_gpu_summary[Phi-3.5-mini-ins
|
||||
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3-mini-128k-instruct-fp8-float16] SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_phi.py::test_llm_phi_quantization_1gpu[Phi-3.5-mini-instruct-fp8-float16] SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_qwen.py::test_llm_qwen_moe_single_gpu_summary[qwen1.5_moe_a2.7b_chat-enable_paged_kv_cache-enable_remove_input_padding-enable_weight_only-enable_fmha] SKIP (Disable for Blackwell)
|
||||
full:B200/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
|
||||
full:B200/test_e2e.py::test_benchmark_sanity[roberta_base] SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/functional SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/quantization SKIP (Disable for Blackwell)
|
||||
full:B200/accuracy/test_cli_flow.py::TestVicuna7B::test_medusa[cuda_graph=False] SKIP (Disable for Blackwell)
|
||||
@ -197,7 +191,6 @@ full:B200/examples/test_medusa.py::test_llm_medusa_with_qaunt_base_model_1gpu[fp
|
||||
full:B200/unittest/bindings SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/attention/test_sage_attention.py unittest/llmapi/test_llm_download.py unittest/llmapi/test_llm_kv_cache_events.py unittest/llmapi/test_mpi_session.py unittest/trt/model/redrafter unittest/trt/model/test_phi.py unittest/trt/model/test_unet.py unittest/trt/python_plugin unittest/tools unittest/utils unittest/others SKIP (Disable for Blackwell)
|
||||
full:B200/test_e2e.py::test_bert_e2e SKIP (Disable for Blackwell)
|
||||
full:B200/test_e2e.py::test_benchmark_sanity[bert_base] SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/quantization/test_weight_only_quant_matmul.py SKIP (Disable for Blackwell)
|
||||
full:B200/unittest/trt/quantization/test_weight_only_groupwise_quant_matmul.py SKIP (Disable for Blackwell)
|
||||
full:B200/examples/test_gpt.py::test_llm_gpt2_starcoder_weight_only[starcoder2-int8-float16] SKIP (Disable for Blackwell)
|
||||
@ -233,7 +226,6 @@ full:B200/accuracy/test_cli_flow.py::TestMixtral8x7B::test_fp4_plugin SKIP (Disa
|
||||
full:B200/accuracy/test_cli_flow.py::TestMixtral8x7B::test_int8_plugin_tp8 SKIP (INT8/INT4 quantization is not supported on SM>=100.)
|
||||
full:B200/examples/test_commandr.py::test_llm_commandr_v01_single_gpu_summary[disable_weight_only] SKIP (Disable for Blackwell OOM)
|
||||
full:B200/unittest/llmapi/test_llm_models.py -m "not (part0 or part1)" SKIP (Disable for Blackwell OOM)
|
||||
full:B200/test_e2e.py::test_benchmark_sanity[t5_base] SKIP (Disable for Blackwell for custom mask input)
|
||||
full:B200/examples/test_llama.py::test_llm_llama_code_llama_quantization_4gpus_summary[CodeLlama-34b-Instruct-tp2pp2-int4_awq-nb:4] SKIP (not support on B200)
|
||||
full:B200/examples/test_llama.py::test_llm_llama_code_llama_quantization_4gpus_summary[CodeLlama-70b-hf-tp2pp2-int4_awq-nb:1] SKIP (not support on B200)
|
||||
full:B200/examples/test_enc_dec.py::test_llm_enc_dec_general[compare_hf-t5-small-float16-enable_gemm_plugin-enable_attention_plugin-enable_paged_kv_cache-tp:1-pp:1-nb:1] SKIP (not support on B200)
|
||||
|
||||
Loading…
Reference in New Issue
Block a user