mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* **Model:** Llama-3.1-Nemotron-Nano-8B-v1
* **Precision:** float16
* **Environment:**
* GPUs: 1 H100 PCIe
* Driver: 570.86.15
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:128,128`
* **Request Throughput:** 81.86 req/sec
* **Total Token Throughput:** 20956.44 tokens/sec
* **Average Request Latency:** 5895.24 ms
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:2000,2000`
* **Request Throughput:** 1.45 req/sec
* **Total Token Throughput:** 5783.92 tokens/sec
* **Average Request Latency:** 211541.08 ms
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:128,128`
* **Request Throughput:** 52.75 req/sec
* **Total Token Throughput:** 13505.00 tokens/sec
* **Average Request Latency:** 5705.50 ms
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:2000,2000`
* **Request Throughput:** 1.41 req/sec
* **Total Token Throughput:** 5630.76 tokens/sec
* **Average Request Latency:** 217139.59 ms
Signed-off-by: Venky Ganesh <gvenkatarama@nvidia.com>
|
||
|---|---|---|
| .. | ||
| _llmapi_perf_evaluator | ||
| accuracy | ||
| deterministic | ||
| disaggregated | ||
| examples | ||
| llmapi | ||
| perf | ||
| stress_test | ||
| sysinfo | ||
| __init__.py | ||
| .test_durations | ||
| agg_unit_mem_df.csv | ||
| ci_profiler.py | ||
| common.py | ||
| conftest.py | ||
| cpp_common.py | ||
| local_venv.py | ||
| pytest.ini | ||
| runner_interface.py | ||
| test_cache.py | ||
| test_cases.yml | ||
| test_cpp.py | ||
| test_e2e.py | ||
| test_list_parser.py | ||
| test_list_validation.py | ||
| test_mlpf_results.py | ||
| test_sanity.py | ||
| test_unittests.py | ||
| trt_test_alternative.py | ||