doc: Minor fixes and clarification (#4975)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
Kaiyu Xie 2025-06-09 14:06:09 +08:00 committed by GitHub
parent 5ee14657b4
commit d90fe3c69c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -28,7 +28,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
- [Evaluation](#evaluation)
- [Serving](#serving)
- [Use trtllm-serve](#use-trtllm-serve)
- [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
- [Use tensorrtllm\_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
- [Advanced Usages](#advanced-usages)
- [Multi-node](#multi-node)
- [mpirun](#mpirun)
@ -38,6 +38,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
- [FlashMLA](#flashmla)
- [FP8 KV Cache and MLA](#fp8-kv-cache-and-mla)
- [W4AFP8](#w4afp8)
- [Activation calibration](#activation-calibration)
- [Weight quantization and assembling](#weight-quantization-and-assembling)
- [KV Cache Reuse](#kv-cache-reuse)
- [Notes and Troubleshooting](#notes-and-troubleshooting)
- [Known Issues](#known-issues)
@ -232,6 +234,8 @@ trtllm-eval --model <YOUR_MODEL_DIR> \
## Serving
### Use trtllm-serve
Take max-throughput scenario on B200 as an example, the settings are extracted from the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput). **For users' own models and cases, the specific settings could be different to get best performance.**
To serve the model using `trtllm-serve`:
```bash
@ -259,12 +263,12 @@ trtllm-serve \
--host localhost \
--port 8000 \
--backend pytorch \
--max_batch_size 161 \
--max_num_tokens 1160 \
--max_batch_size 384 \
--max_num_tokens 1536 \
--tp_size 8 \
--ep_size 8 \
--pp_size 1 \
--kv_cache_free_gpu_memory_fraction 0.95 \
--kv_cache_free_gpu_memory_fraction 0.85 \
--extra_llm_api_options ./extra-llm-api-config.yml
```