mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
doc: Add tensorrtllm_backend serving documentation in the Deepseek-V3 README (#4338)
Add tensorrtllm_backend serving option in the Deepseek-V3 README Signed-off-by: Simeng Liu <simengl@nvidia.com>
This commit is contained in:
parent
7fb0af9320
commit
efe0972efb
@ -24,6 +24,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
|
||||
- [Long context support](#long-context-support)
|
||||
- [Evaluation](#evaluation)
|
||||
- [Serving](#serving)
|
||||
- [Use trtllm-serve](#use-trtllm-serve)
|
||||
- [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental)
|
||||
- [Advanced Usages](#advanced-usages)
|
||||
- [Multi-node](#multi-node)
|
||||
- [mpirun](#mpirun)
|
||||
@ -226,6 +228,7 @@ trtllm-eval --model <YOUR_MODEL_DIR> \
|
||||
```
|
||||
|
||||
## Serving
|
||||
### Use trtllm-serve
|
||||
|
||||
To serve the model using `trtllm-serve`:
|
||||
|
||||
@ -278,6 +281,27 @@ curl http://localhost:8000/v1/completions \
|
||||
|
||||
For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`.
|
||||
|
||||
|
||||
### Use tensorrtllm_backend for triton inference server (Experimental)
|
||||
To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature.
|
||||
|
||||
The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml
|
||||
|
||||
```bash
|
||||
model: <replace with the deepseek model or path to the checkpoints>
|
||||
backend: "pytorch"
|
||||
```
|
||||
Additional configs similar to `extra-llm-api-config.yml` can be added to the yaml file and will be used to configure the LLM model. At the minimum, `tensor_parallel_size` needs to be set to 8 on H200 and B200 machines and 16 on H100.
|
||||
|
||||
The initial loading of the model can take around one hour and the following runs will take advantage of the weight caching.
|
||||
|
||||
To send requests to the server, try:
|
||||
```bash
|
||||
curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "Hello, my name is", "sampling_param_temperature":0.8, "sampling_param_top_p":0.95}' | sed 's/^data: //' | jq
|
||||
```
|
||||
Available parameters for the requests are listed in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/config.pbtxt.
|
||||
|
||||
|
||||
## Advanced Usages
|
||||
### Multi-node
|
||||
TensorRT-LLM supports multi-node inference. You can use mpirun or Slurm to launch multi-node jobs. We will use two nodes for this example.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user