mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
[None][doc] Update DS-R1 example doc (#9231)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
This commit is contained in:
parent
67d3eb26af
commit
255e4ea9f0
@ -247,6 +247,7 @@ cuda_graph_config:
|
||||
max_batch_size: 1024
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
enable_block_reuse: false
|
||||
dtype: fp8
|
||||
stream_interval: 10
|
||||
EOF
|
||||
@ -258,22 +259,34 @@ cat >./extra-llm-api-config.yml <<EOF
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
batch_sizes:
|
||||
- 2048
|
||||
- 1024
|
||||
- 896
|
||||
- 512
|
||||
- 384
|
||||
- 256
|
||||
- 192
|
||||
- 160
|
||||
- 128
|
||||
- 96
|
||||
- 64
|
||||
- 48
|
||||
- 32
|
||||
- 24
|
||||
- 16
|
||||
- 8
|
||||
- 4
|
||||
- 2
|
||||
- 1
|
||||
kv_cache_config:
|
||||
enable_block_reuse: false
|
||||
dtype: fp8
|
||||
stream_interval: 10
|
||||
enable_attention_dp: true
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
EOF
|
||||
```
|
||||
|
||||
@ -285,6 +298,7 @@ cuda_graph_config:
|
||||
max_batch_size: 1024
|
||||
enable_attention_dp: false
|
||||
kv_cache_config:
|
||||
enable_block_reuse: false
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
stream_interval: 10
|
||||
@ -301,7 +315,12 @@ cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 512
|
||||
enable_attention_dp: true
|
||||
attention_dp_config:
|
||||
batching_wait_iters: 0
|
||||
enable_balance: true
|
||||
timeout_iters: 60
|
||||
kv_cache_config:
|
||||
enable_block_reuse: false
|
||||
dtype: fp8
|
||||
free_gpu_memory_fraction: 0.8
|
||||
stream_interval: 10
|
||||
@ -316,12 +335,11 @@ trtllm-serve \
|
||||
--host localhost \
|
||||
--port 8000 \
|
||||
--backend pytorch \
|
||||
--max_batch_size 1024 \
|
||||
--max_batch_size 2048 \
|
||||
--max_num_tokens 8192 \
|
||||
--tp_size 8 \
|
||||
--ep_size 8 \
|
||||
--pp_size 1 \
|
||||
--kv_cache_free_gpu_memory_fraction 0.9 \
|
||||
--extra_llm_api_options ./extra-llm-api-config.yml
|
||||
```
|
||||
It's possible seeing OOM issues on some configs. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem. If you are using max-throughput config, reduce `max_num_tokens` to `3072` to avoid OOM issues.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user