doc: Add tensorrtllm_backend serving documentation in the Deepseek-V3 README (#4338)

Add tensorrtllm_backend serving option in the Deepseek-V3 README Signed-off-by: Simeng Liu <simengl@nvidia.com>
2026-01-13 22:18:36 +08:00 · 2025-05-14 18:31:28 -07:00 · 2025-05-14 18:31:28 -07:00 · efe0972efb
commit efe0972efb
parent 7fb0af9320
1 changed files with 24 additions and 0 deletions
--- a/examples/models/core/deepseek_v3/README.md
+++ b/examples/models/core/deepseek_v3/README.md
@ -24,6 +24,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
    - [Long context support](#long-context-support)
  - [Evaluation](#evaluation)
  - [Serving](#serving)
+    - [Use trtllm-serve](#use-trtllm-serve)
+    - [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
  - [Advanced Usages](#advanced-usages)
    - [Multi-node](#multi-node)
      - [mpirun](#mpirun)
@ -226,6 +228,7 @@ trtllm-eval --model  <YOUR_MODEL_DIR> \
 ```

 ## Serving
+### Use trtllm-serve

 To serve the model using `trtllm-serve`:

@ -278,6 +281,27 @@ curl http://localhost:8000/v1/completions \

 For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.

+
+### Use tensorrtllm_backend for triton inference server (Experimental)
+To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature.
+
+The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml
+
+```bash
+model: <replace with the deepseek model or path to the checkpoints>
+backend: "pytorch"
+```
+Additional configs similar to `extra-llm-api-config.yml` can be added to the yaml file and will be used to configure the LLM model. At the minimum, `tensor_parallel_size` needs to be set to 8 on H200 and B200 machines and 16 on H100.
+
+The initial loading of the model can take around one hour and the following runs will take advantage of the weight caching.
+
+To send requests to the server, try:
+```bash
+curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "Hello, my name is", "sampling_param_temperature":0.8, "sampling_param_top_p":0.95}' | sed 's/^data: //' | jq
+```
+Available parameters for the requests are listed in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/config.pbtxt.
+
+
 ## Advanced Usages
 ### Multi-node
 TensorRT-LLM supports multi-node inference. You can use mpirun or Slurm to launch multi-node jobs. We will use two nodes for this example.