diff --git a/examples/models/core/deepseek_v3/README.md b/examples/models/core/deepseek_v3/README.md index 8fc6f00537..948688c681 100644 --- a/examples/models/core/deepseek_v3/README.md +++ b/examples/models/core/deepseek_v3/README.md @@ -21,7 +21,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/ - [Quick Start](#quick-start) - [Run a single inference](#run-a-single-inference) - [Multi-Token Prediction (MTP)](#multi-token-prediction-mtp) - - [Run evaluation on GPQA dataset](#run-evaluation-on-gpqa-dataset) + - [Long context support](#long-context-support) + - [Evaluation](#evaluation) - [Serving](#serving) - [Advanced Usages](#advanced-usages) - [Multi-node](#multi-node) @@ -88,6 +89,68 @@ python quickstart_advanced.py --model_dir --spec_decode_algo MT `N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared. +### Long context support +DeepSeek-V3 model can support up to 128k context length. The following shows how to benchmark 64k and 128k input_seq_length using trtllm-bench on B200. +To avoid OOM (out of memory) error, you need to adjust the values of "--max_batch_size", "--max_num_tokens" and "--kv_cache_free_gpu_mem_fraction". +#### ISL-64k-OSL-1024 +```bash +DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1 +python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \ + --stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \ + token-norm-dist \ + --input-mean 65536 --output-mean 1024 \ + --input-stdev 0 --output-stdev 0 \ + --num-requests 24 > /tmp/benchmarking_64k.txt + +cat < /tmp/extra-llm-api-config.yml +pytorch_backend_config: + enable_overlap_scheduler: true + use_cuda_graph: true + cuda_graph_padding_enabled: true + cuda_graph_batch_sizes: [1, 4, 8, 12] +EOF + +trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \ + --tp 8 --ep 8 \ + --warmup 0 \ + --dataset /tmp/benchmarking_64k.txt \ + --backend pytorch \ + --max_batch_size 12 \ + --max_num_tokens 65548 \ + --kv_cache_free_gpu_mem_fraction 0.6 \ + --extra_llm_api_options /tmp/extra-llm-api-config.yml +``` + +#### ISL-128k-OSL-1024 +```bash +DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1 +python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \ + --stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \ + token-norm-dist \ + --input-mean 131072 --output-mean 1024 \ + --input-stdev 0 --output-stdev 0 \ + --num-requests 4 > /tmp/benchmarking_128k.txt + +cat < /tmp/extra-llm-api-config.yml +pytorch_backend_config: + enable_overlap_scheduler: true + use_cuda_graph: true + cuda_graph_padding_enabled: true + cuda_graph_batch_sizes: [1, 2] + moe_max_num_tokens: 16384 +EOF + +trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \ + --tp 8 --ep 8 \ + --warmup 0 \ + --dataset /tmp/benchmarking_128k.txt \ + --backend pytorch \ + --max_batch_size 2 \ + --max_num_tokens 131074 \ + --kv_cache_free_gpu_mem_fraction 0.3 \ + --extra_llm_api_options /tmp/extra-llm-api-config.yml +``` + ## Evaluation Evaluate the model accuracy using `trtllm-eval`.