diff --git a/examples/models/core/deepseek_v3/README.md b/examples/models/core/deepseek_v3/README.md index cbbcf00227..06738820fd 100644 --- a/examples/models/core/deepseek_v3/README.md +++ b/examples/models/core/deepseek_v3/README.md @@ -24,6 +24,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/ - [Long context support](#long-context-support) - [Evaluation](#evaluation) - [Serving](#serving) + - [Use trtllm-serve](#use-trtllm-serve) + - [Use tensorrtllm_backend for triton inference server (Experimental)](#use-tensorrtllm_backend-for-triton-inference-server-experimental) - [Advanced Usages](#advanced-usages) - [Multi-node](#multi-node) - [mpirun](#mpirun) @@ -226,6 +228,7 @@ trtllm-eval --model \ ``` ## Serving +### Use trtllm-serve To serve the model using `trtllm-serve`: @@ -278,6 +281,27 @@ curl http://localhost:8000/v1/completions \ For DeepSeek-R1, use the model name `deepseek-ai/DeepSeek-R1`. + +### Use tensorrtllm_backend for triton inference server (Experimental) +To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as an experimental feature. + +The model configuration file is located at https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/1/model.yaml + +```bash +model: +backend: "pytorch" +``` +Additional configs similar to `extra-llm-api-config.yml` can be added to the yaml file and will be used to configure the LLM model. At the minimum, `tensor_parallel_size` needs to be set to 8 on H200 and B200 machines and 16 on H100. + +The initial loading of the model can take around one hour and the following runs will take advantage of the weight caching. + +To send requests to the server, try: +```bash +curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "Hello, my name is", "sampling_param_temperature":0.8, "sampling_param_top_p":0.95}' | sed 's/^data: //' | jq +``` +Available parameters for the requests are listed in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/llmapi/tensorrt_llm/config.pbtxt. + + ## Advanced Usages ### Multi-node TensorRT-LLM supports multi-node inference. You can use mpirun or Slurm to launch multi-node jobs. We will use two nodes for this example.