TensorRT-LLMs/tests/integration/defs/perf
Venky 62fea1e885
test(perf): Add Llama-3.1-Nemotron-8B-v1 to perf tests (#3822)
*   **Model:** Llama-3.1-Nemotron-Nano-8B-v1
*   **Precision:** float16
*   **Environment:**
    *   GPUs: 1 H100 PCIe
    *   Driver: 570.86.15

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:128,128`
*   **Request Throughput:** 81.86 req/sec
*   **Total Token Throughput:** 20956.44 tokens/sec
*   **Average Request Latency:** 5895.24 ms

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:2000,2000`
*   **Request Throughput:** 1.45 req/sec
*   **Total Token Throughput:** 5783.92 tokens/sec
*   **Average Request Latency:** 211541.08 ms

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:128,128`
*   **Request Throughput:** 52.75 req/sec
*   **Total Token Throughput:** 13505.00 tokens/sec
*   **Average Request Latency:** 5705.50 ms

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:2000,2000`
*   **Request Throughput:** 1.41 req/sec
*   **Total Token Throughput:** 5630.76 tokens/sec
*   **Average Request Latency:** 217139.59 ms

Signed-off-by: Venky Ganesh <gvenkatarama@nvidia.com>
2025-05-06 17:17:55 -07:00
..
__init__.py Update (#2978) 2025-03-23 16:39:35 +08:00
allowed_configs.py Update (#2978) 2025-03-23 16:39:35 +08:00
build.py Update (#2978) 2025-03-23 16:39:35 +08:00
data_export.py Update (#2978) 2025-03-23 16:39:35 +08:00
data.py Update (#2978) 2025-03-23 16:39:35 +08:00
gpu_clock_lock.py Update (#2978) 2025-03-23 16:39:35 +08:00
misc.py Update (#2978) 2025-03-23 16:39:35 +08:00
model_yaml_config.py tests: change qa perf test to trtllm-bench (#3189) 2025-04-17 09:53:32 +08:00
README.md Update (#2978) 2025-03-23 16:39:35 +08:00
sanity_perf_check.py Update (#2978) 2025-03-23 16:39:35 +08:00
session_data_writer.py Update (#2978) 2025-03-23 16:39:35 +08:00
test_perf.py test(perf): Add Llama-3.1-Nemotron-8B-v1 to perf tests (#3822) 2025-05-06 17:17:55 -07:00
utils.py waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3657) 2025-04-22 14:51:45 +08:00

Sanity Perf Check Introduction

Background

The sanity perf check mechanism is the way of perf regression detection for L0 testing. We create the base_perf.csv which consists of the several models' perf baseline and use the sanity_perf_check.py to detect the perf regression.

Usage

There're four typical scenarios for sanity perf check feature.

  1. The newly added MR doesn't impact the models' perf, the perf check will pass w/o exception.
  2. The newly added MR introduces the new model into perf model list. The sanity check will trigger the exception and the author of this MR needs to add the perf into base_perf.csv.
  3. The newly added MR improves the existed models' perf and the MR author need to refresh the base_perf.csv data w/ new baseline.
  4. The newly added MR introduces the perf regression and the MR author needs to fix the issue and rerun the pipeline.