doc: Add tensorrtllm_backend serving documentation in the Deepseek-V3 README (#4338)

Add tensorrtllm_backend serving option in the Deepseek-V3 README

Signed-off-by: Simeng Liu <simengl@nvidia.com>
This commit is contained in:
Simeng Liu 2025-05-14 18:31:28 -07:00 committed by GitHub
parent 7fb0af9320
commit efe0972efb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -24,6 +24,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
- [Long context support](#long-context-support)
- [Evaluation](#evaluation)
- [Serving](#serving)
- [Use trtllm-serve](#use-trtllm-serve)
- [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
- [Advanced Usages](#advanced-usages)
- [Multi-node](#multi-node)
- [mpirun](#mpirun)
@ -226,6 +228,7 @@ trtllm-eval --model <YOUR_MODEL_DIR> \
```
## Serving
### Use trtllm-serve
To serve the model using `trtllm-serve`:
@ -278,6 +281,27 @@ curl http://localhost:8000/v1/completions \
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
### Use tensorrtllm_backend for triton inference server (Experimental)
To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature.
The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml
```bash
model: <replace with the deepseek model or path to the checkpoints>
backend: "pytorch"
```
Additional configs similar to `extra-llm-api-config.yml` can be added to the yaml file and will be used to configure the LLM model. At the minimum, `tensor_parallel_size` needs to be set to 8 on H200 and B200 machines and 16 on H100.
The initial loading of the model can take around one hour and the following runs will take advantage of the weight caching.
To send requests to the server, try:
```bash
curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "Hello, my name is", "sampling_param_temperature":0.8, "sampling_param_top_p":0.95}' | sed 's/^data: //' | jq
```
Available parameters for the requests are listed in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/config.pbtxt.
## Advanced Usages
### Multi-node
TensorRT-LLM supports multi-node inference. You can use mpirun or Slurm to launch multi-node jobs. We will use two nodes for this example.