Add docs about DeepSeek-R1 long context support. (#3910)

* Add docs about DeepSeek-R1 long context support Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> * update docs Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> * reformat Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> --------- Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2025-04-28 18:33:05 +08:00 · 2025-04-28 18:33:05 +08:00 · 3617e948fd
commit 3617e948fd
parent ad15e45f07
1 changed files with 64 additions and 1 deletions
--- a/examples/models/core/deepseek_v3/README.md
+++ b/examples/models/core/deepseek_v3/README.md
@ -21,7 +21,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
  - [Quick Start](#quick-start)
    - [Run a single inference](#run-a-single-inference)
    - [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp)
-    - [Run evaluation on GPQA dataset](#run-evaluation-on-gpqa-dataset)
+    - [Long context support](#long-context-support)
+  - [Evaluation](#evaluation)
  - [Serving](#serving)
  - [Advanced Usages](#advanced-usages)
    - [Multi-node](#multi-node)
@ -88,6 +89,68 @@ python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MT

 `N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared.

+### Long context support
+DeepSeek-V3 model can support up to 128k context length. The following shows how to benchmark 64k and 128k input_seq_length using trtllm-bench on B200.
+To avoid OOM (out of memory) error, you need to adjust the values of "--max_batch_size", "--max_num_tokens" and "--kv_cache_free_gpu_mem_fraction".
+#### ISL-64k-OSL-1024
+```bash
+DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1
+python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
+        --stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \
+        token-norm-dist \
+        --input-mean 65536 --output-mean 1024 \
+        --input-stdev 0 --output-stdev 0 \
+        --num-requests 24 > /tmp/benchmarking_64k.txt
+
+cat <<EOF > /tmp/extra-llm-api-config.yml
+pytorch_backend_config:
+  enable_overlap_scheduler: true
+  use_cuda_graph: true
+  cuda_graph_padding_enabled: true
+  cuda_graph_batch_sizes: [1, 4, 8, 12]
+EOF
+
+trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
+        --tp 8 --ep 8 \
+        --warmup 0 \
+        --dataset /tmp/benchmarking_64k.txt \
+        --backend pytorch \
+        --max_batch_size 12 \
+        --max_num_tokens 65548 \
+        --kv_cache_free_gpu_mem_fraction 0.6 \
+        --extra_llm_api_options /tmp/extra-llm-api-config.yml
+```
+
+#### ISL-128k-OSL-1024
+```bash
+DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1
+python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
+        --stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \
+        token-norm-dist \
+        --input-mean 131072 --output-mean 1024 \
+        --input-stdev 0 --output-stdev 0 \
+        --num-requests 4 > /tmp/benchmarking_128k.txt
+
+cat <<EOF > /tmp/extra-llm-api-config.yml
+pytorch_backend_config:
+  enable_overlap_scheduler: true
+  use_cuda_graph: true
+  cuda_graph_padding_enabled: true
+  cuda_graph_batch_sizes: [1, 2]
+  moe_max_num_tokens: 16384
+EOF
+
+trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
+        --tp 8 --ep 8 \
+        --warmup 0 \
+        --dataset /tmp/benchmarking_128k.txt \
+        --backend pytorch \
+        --max_batch_size 2 \
+        --max_num_tokens 131074 \
+        --kv_cache_free_gpu_mem_fraction 0.3 \
+        --extra_llm_api_options /tmp/extra-llm-api-config.yml
+```
+
 ## Evaluation

 Evaluate the model accuracy using `trtllm-eval`.