doc: Minor fixes and clarification (#4975)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2025-06-09 14:06:09 +08:00 · 2025-06-09 14:06:09 +08:00 · d90fe3c69c
commit d90fe3c69c
parent 5ee14657b4
1 changed files with 8 additions and 4 deletions
--- a/examples/models/core/deepseek_v3/README.md
+++ b/examples/models/core/deepseek_v3/README.md
@ -28,7 +28,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
  - [Evaluation](#evaluation)
  - [Serving](#serving)
    - [Use trtllm-serve](#use-trtllm-serve)
-    - [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
+    - [Use tensorrtllm\_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
  - [Advanced Usages](#advanced-usages)
    - [Multi-node](#multi-node)
      - [mpirun](#mpirun)
@ -38,6 +38,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
    - [FlashMLA](#flashmla)
    - [FP8 KV Cache and MLA](#fp8-kv-cache-and-mla)
    - [W4AFP8](#w4afp8)
+      - [Activation calibration](#activation-calibration)
+      - [Weight quantization and assembling](#weight-quantization-and-assembling)
    - [KV Cache Reuse](#kv-cache-reuse)
  - [Notes and Troubleshooting](#notes-and-troubleshooting)
  - [Known Issues](#known-issues)
@ -232,6 +234,8 @@ trtllm-eval --model  <YOUR_MODEL_DIR> \
 ## Serving
 ### Use trtllm-serve

+Take max-throughput scenario on B200 as an example, the settings are extracted from the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput). **For users' own models and cases, the specific settings could be different to get best performance.**
+
 To serve the model using `trtllm-serve`:

 ```bash
@ -259,12 +263,12 @@ trtllm-serve \
  --host localhost \
  --port 8000 \
  --backend pytorch \
-  --max_batch_size 161 \
-  --max_num_tokens 1160 \
+  --max_batch_size 384 \
+  --max_num_tokens 1536 \
  --tp_size 8 \
  --ep_size 8 \
  --pp_size 1 \
-  --kv_cache_free_gpu_memory_fraction 0.95 \
+  --kv_cache_free_gpu_memory_fraction 0.85 \
  --extra_llm_api_options ./extra-llm-api-config.yml
 ```