mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
Merge e6a672fab7 into 38296a472b
This commit is contained in:
commit
f11a70cce3
@ -0,0 +1,360 @@
|
||||
#
|
||||
|
||||
# Quick Start Recipe for DeepSeek-R1 FP8 and NVFP4
|
||||
|
||||
## Introduction
|
||||
|
||||
This deployment guide provides step-by-step instructions for running the DeepSeek R1 model using SGLang, specifically configured for a single-node NVIDIA B200 system, with plans to extend support to NVL72 platforms. It delivers installation guides, setup steps, and custom installation procedures for both SGLang and the required FlashInfer components.
|
||||
The guide covers the requirements for running DeepSeek R1, including blockscale FP8 and FP4 datatypes for Blackwell.
|
||||
Additional highlights:
|
||||
|
||||
* The deployment starts with obtaining model weights, then prepares the software—ensuring custom FP8/FP4 quantization and blockscale datatype support are included.
|
||||
* Instructions detail how to configure SGLang runtime for DeepSeek R1, including hardware (B200) tuning and optional extension to NVL72 multi-node infrastructures as support becomes available.
|
||||
* Finally, it walks through server launch, inference validation, and best practices for production operations, ensuring you have a reliable, high-throughput setup for advanced language model inference.
|
||||
|
||||
## Access & Licensing
|
||||
|
||||
To use DeepSeekR1, you must first agree to DeepSeek’s Community License ([https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/LICENSE)). NVIDIA’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
|
||||
|
||||
## Models
|
||||
|
||||
* FP4 (4-bit quantized): [nvidia/DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4)
|
||||
* FP8 (8-bit quantized): [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
OS: Linux
|
||||
Drivers: CUDA Driver 575 or above
|
||||
GPU: Blackwell architecture
|
||||
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html)
|
||||
|
||||
## Building Docker Image
|
||||
|
||||
Build a Docker image with SGLang and all dependencies using the official SGLang base image as a starting point. The provided example is for x86 (NVIDIA B200);
|
||||
|
||||
```shell
|
||||
# This is x86 container. We will modify this to multiplatform blackwell in the next iteration
|
||||
FROM lmsysorg/sglang:v0.4.10.post1-cu128-b200@sha256:1a9e19b409059075d47ca58159b370adc6d76b7eb3a85680c55f65c38b11e9db
|
||||
|
||||
WORKDIR /workspace
|
||||
|
||||
# Install latest pip and necessary development tools
|
||||
RUN apt-get update && apt-get install -y git python3-dev build-essential ninja-build && \
|
||||
pip install --upgrade pip
|
||||
|
||||
# Install lm-eval harness (latest version from commit or main branch)
|
||||
RUN pip install --no-build-isolation "lm-eval[api] @ git+https://github.com/EleutherAI/lm-evaluation-harness@4f8195f"
|
||||
|
||||
# Clone SGLang and FlashInfer sources
|
||||
RUN git clone https://github.com/sgl-project/sglang.git /workspace/sglang/sglang-src -b #HASH && \
|
||||
git clone --recursive https://github.com/FlashInfer-ai/FlashInfer.git -b v0.2.9rc2 /workspace/flashinfer
|
||||
|
||||
|
||||
# Build/install SGLang from source
|
||||
RUN cd /workspace/sglang/sglang-src && \
|
||||
pip install --break-system-packages -e python
|
||||
|
||||
# Build/install FlashInfer from source, with AOT kernels for Blackwell
|
||||
RUN cd /workspace/flashinfer && \
|
||||
pip install ninja && \
|
||||
export TORCH_CUDA_ARCH_LIST="9.0a 9.0 10.0 10.0a" && \
|
||||
python -m pip install --no-build-isolation -e . -v
|
||||
|
||||
# Install any additional dependencies for your workload
|
||||
RUN pip install -U nvidia-cudnn-cu12
|
||||
RUN pip install --break-system-packages httpx openai
|
||||
|
||||
ENV PYTHONPATH=/workspace/sglang/sglang-src:/workspace/flashinfer:${PYTHONPATH}
|
||||
```
|
||||
|
||||
### Running the Docker Container
|
||||
|
||||
Once built, use this script to start a container with full GPU access and an expanded shared memory segment (important for multi-GPU workloads):
|
||||
|
||||
```shell
|
||||
#!/bin/bash
|
||||
|
||||
IMAGE_NAME="sglang"
|
||||
|
||||
docker build --pull --no-cache -t "$IMAGE_NAME" .
|
||||
|
||||
docker run \
|
||||
--network=host \
|
||||
--gpus=all \
|
||||
--shm-size=256gb \
|
||||
-ti --rm "$IMAGE_NAME" bash
|
||||
```
|
||||
|
||||
## Downloading the Quantized Model Weights
|
||||
|
||||
### Downloading Inside Docker
|
||||
|
||||
To fetch the weights directly within your Docker container, use these commands:
|
||||
|
||||
For FP4:
|
||||
|
||||
```shell
|
||||
MODEL_PATH=/workspace/model
|
||||
mkdir -p $MODEL_PATH
|
||||
cd $MODEL_PATH
|
||||
git clone https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4
|
||||
```
|
||||
|
||||
For FP8:
|
||||
|
||||
```shell
|
||||
MODEL_PATH=/workspace/model
|
||||
mkdir -p $MODEL_PATH
|
||||
cd $MODEL_PATH
|
||||
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-05284
|
||||
```
|
||||
|
||||
## Launch the SGLang Server and Client
|
||||
|
||||
SGLang follows a client–server architecture for efficient large language model (LLM) serving. To use it, you need to run two separate processes: one for the server and one for the client.
|
||||
|
||||
### Server Process
|
||||
|
||||
Below is an example command to launch the SGLang server with the FP4 DSR1 model. The explanation of each flag is shown in the “Configs and Parameters” section.
|
||||
|
||||
launch\_server.sh
|
||||
|
||||
```shell
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path nvidia/DeepSeek-R1-0528-FP4 \
|
||||
--trust-remote-code
|
||||
--quantization modelopt_fp4 \
|
||||
--tp 8
|
||||
--enable-flashinfer-cutlass-moe
|
||||
|
||||
# or if you have downloaded the model:
|
||||
|
||||
# python3 -m sglang.launch_server \
|
||||
# --model-path nvidia/DeepSeek-R1-0528-FP4 \
|
||||
# --trust-remote-code
|
||||
# --quantization modelopt_fp4 \
|
||||
# --tp 8
|
||||
# --enable-flashinfer-cutlass-moe
|
||||
```
|
||||
|
||||
After the server is set up, the client can now send prompt requests to the server and receive results.
|
||||
|
||||
**Note on Quantization Choice:**
|
||||
For Hopper, FP8 offers the best performance for most workloads. For Blackwell, NVFP4 provides additional memory savings and throughput gains, but may require tuning to maintain accuracy on certain tasks.
|
||||
|
||||
## Configuration Profiles: High Throughput vs. Low Latency
|
||||
|
||||
SGLang supports tuning for different deployment needs. For best results, you should select configuration parameters according to your workload—whether you prioritize maximizing throughput (processing many concurrent requests) or minimizing response latency (faster lower number of requests).
|
||||
|
||||
### FP8, High Throughput Configuration
|
||||
|
||||
Use these settings to maximize the number of concurrent requests and model utilization across multiple GPUs.
|
||||
|
||||
**Server:**
|
||||
|
||||
```shell
|
||||
SGL_ENABLE_JIT_DEEPGEMM=0 SGLANG_CUTLASS_MOE=1 \
|
||||
python3 -m sglang.launch_server \
|
||||
--tokenizer-path deepseek-ai/DeepSeek-R1-0528 \
|
||||
--trust-remote-code \
|
||||
--enable-dp-attention \
|
||||
--disable-radix-cache \
|
||||
--max-running-requests 3072 \
|
||||
--chunked-prefill-size 32768 \
|
||||
--mem-fraction-static 0.89 \
|
||||
--max-prefill-tokens 32768 \
|
||||
--model-path deepseek-ai/DeepSeek-R1-0528 \
|
||||
--tensor-parallel-size 8 \
|
||||
--data-parallel-size 8 \
|
||||
--attention-backend cutlass_mla
|
||||
```
|
||||
|
||||
**Client (Benchmarking):**
|
||||
|
||||
```
|
||||
python3 -m sglang.bench_serving \
|
||||
--model deepseek-ai/DeepSeek-R1-0528 \
|
||||
--dataset-name random \
|
||||
--backend sglang-oai \
|
||||
--random-range-ratio 1 \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 1024 \
|
||||
--max-concurrency 3072 \
|
||||
--num-prompts 6148
|
||||
|
||||
```
|
||||
|
||||
*This setup is ideal for batch processing or when serving many users at once.*
|
||||
|
||||
### FP8, Low Latency Configuration (TEP8)
|
||||
|
||||
Choose these parameters to reduce the time it takes to get a response for a single or very few requests, e.g., for demo or conversational agents requiring snappy replies.
|
||||
|
||||
**Server:**
|
||||
|
||||
```shell
|
||||
SGL_ENABLE_JIT_DEEPGEMM=0 \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path /workspace/model/DeepSeek-R1-0528/ \
|
||||
--trust-remote-code \
|
||||
--tp 8 \
|
||||
--enable-ep-moe \
|
||||
--enable-flashinfer-trtllm-moe
|
||||
|
||||
```
|
||||
|
||||
**Client (Benchmarking):**
|
||||
|
||||
```shell
|
||||
python3 -m sglang.bench_serving \
|
||||
--model /workspace/model/DeepSeek-R1-0528 \
|
||||
--dataset-name random \
|
||||
--backend sglang-oai \
|
||||
--random-range-ratio 1 \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 8192 \
|
||||
--max-concurrency 1 \
|
||||
--num-prompts 10
|
||||
|
||||
```
|
||||
|
||||
**Sample Output**
|
||||
Below is an example of the output produced with the configuration
|
||||
|
||||
```shell
|
||||
# Server Initialization
|
||||
|
||||
[2025-08-01 21:05:23 TP0 EP0] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP2 EP2] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP5 EP5] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP4 EP4] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP6 EP6] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP7 EP7] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP3 EP3] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:23 TP1 EP1] Registering 6273 cuda graph addresses
|
||||
[2025-08-01 21:05:24 TP0 EP0] Capture cuda graph end. Time elapsed: 26.07 s. mem usage=6.37 GB. avail mem=22.72 GB.
|
||||
[2025-08-01 21:05:24 TP0 EP0] max_total_num_tokens=980344, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=3063, context_len=163840, available_gpu_mem=22.72 GB
|
||||
[2025-08-01 21:05:24] INFO: Started server process [16140]
|
||||
[2025-08-01 21:05:24] INFO: Waiting for application startup.
|
||||
[2025-08-01 21:05:24] INFO: Application startup complete.
|
||||
[2025-08-01 21:05:24] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
|
||||
[2025-08-01 21:05:25] INFO: 127.0.0.1:48312 - "GET /get_model_info HTTP/1.1" 200 OK
|
||||
[2025-08-01 21:05:25 TP0 EP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
|
||||
[2025-08-01 21:05:25 TP0 EP0] Using configuration from /workspace/sglang/sglang-src/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
|
||||
[2025-08-01 21:05:28] INFO: 127.0.0.1:48316 - "POST /generate HTTP/1.1" 200 OK
|
||||
[2025-08-01 21:05:28] The server is fired up and ready to roll!
|
||||
|
||||
|
||||
#Client -- you would see results like below
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang-oai
|
||||
Traffic request rate: inf
|
||||
Max request concurrency: 1
|
||||
Successful requests: 10
|
||||
Benchmark duration (s): 1287.25
|
||||
Total input tokens: 10240
|
||||
Total generated tokens: ---
|
||||
Total generated tokens (retokenized): ---
|
||||
Request throughput (req/s): ---
|
||||
Input token throughput (tok/s): ---
|
||||
Output token throughput (tok/s): ---
|
||||
Total token throughput (tok/s): ---
|
||||
Concurrency: ---
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): ---
|
||||
Median E2E Latency (ms): ---
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): ---
|
||||
Median TTFT (ms): ---
|
||||
P99 TTFT (ms): ---
|
||||
---------------Inter-Token Latency----------------
|
||||
Mean ITL (ms): ---
|
||||
Median ITL (ms): ---
|
||||
P95 ITL (ms): ---
|
||||
P99 ITL (ms): ---
|
||||
Max ITL (ms): ---
|
||||
==================================================
|
||||
|
||||
```
|
||||
|
||||
### Key Performance Metrics
|
||||
| | |
|
||||
|:---|:---|
|
||||
| **Median Time to First Token (ms)** | The typical time elapsed from when a request is sent until the first output token is generated in milliseconds. |
|
||||
| **Median ITL (ms)** | The typical time delay between the completion of one token and the completion of the next in milliseconds. |
|
||||
| **Median E2E Latency (ms)** | The typical total time from when a request is submitted until the final token of the response is received in milliseconds. |
|
||||
| **Total token throughput (tok/s)** | The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens in tokens/second. |
|
||||
### FP4 Serve Command
|
||||
|
||||
```shell
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path nvidia/DeepSeek-R1-0528-FP4 \
|
||||
--trust-remote-code \
|
||||
--disable-radix-cache \
|
||||
--max-running-requests 3072 \
|
||||
--chunked-prefill-size 32768 \
|
||||
--mem-fraction-static 0.89 \
|
||||
--max-prefill-tokens 32768 \
|
||||
--quantization modelopt_fp4 \
|
||||
--tp 8 \
|
||||
--enable-flashinfer-cutlass-moe \
|
||||
--enable-ep-moe \
|
||||
--ep-size 8
|
||||
|
||||
```
|
||||
|
||||
### Testing Accuracy
|
||||
|
||||
When the server is still running, we can run accuracy command:
|
||||
|
||||
```shell
|
||||
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1316 --parallel 1316
|
||||
```
|
||||
|
||||
### Testing Performance
|
||||
|
||||
To benchmark the performance, you can use the “`sglang.bench_serving`” command
|
||||
|
||||
Run\_performance.sh
|
||||
|
||||
```shell
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--model nvidia/DeepSeek-R1-0528-FP4 \
|
||||
--num-prompts 512 \
|
||||
--dataset-name random \
|
||||
--random-input-len 1000 \
|
||||
--random-output-len 1000 \
|
||||
--random-range-ratio 1 \
|
||||
--max-concurrency 512 \--warmup-request 512 \--save-result --result-filename vllm_benchmark_serving_results.json
|
||||
|
||||
```
|
||||
|
||||
### Server Flag Descriptions
|
||||
|
||||
|||
|
||||
| :---- | :---- |
|
||||
| `--tokenizer-path` | The path of the tokenizer. |
|
||||
| `--trust-remote-code` | Whether or not to allow for custom models defined on the Hub in their own modeling files. |
|
||||
| `--enable-dp-attention` | Whether to you data parallel scheme over the attention layer |
|
||||
| `--disable-radix-cache` | Disable RadixAttention for prefix caching. |
|
||||
| `--max-running-requests` | The maximum number of running requests |
|
||||
| `--chunked-prefill-size` | The maximum number of tokens in a chunk for the chunked prefill. Setting this to \-1 means disabling chunked prefill |
|
||||
| `--mem-fraction-static` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
||||
| `--max-prefill-tokens` | The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model’s maximum context length. |
|
||||
| `--model-path` | The path of the model weights. This can be a local folder or a Hugging Face repo ID. |
|
||||
| `--tensor-parallel-size` | Tensor parallel size |
|
||||
| `--data-parallel-size` | Data parallel size for used on for the attention layer on DSR1 |
|
||||
| `--attention-backend` | Which backend to use for the attention layer |
|
||||
|
||||
### Client Flag Description
|
||||
|||
|
||||
| :---- | :---- |
|
||||
| `--model` | The fp4 or the fp8 DSR1 model |
|
||||
| `--num-prompts` | Total number of prompts to process |
|
||||
| `--dataset-name` | Which dataset to use for benchmarking. We use a “random” dataset here. |
|
||||
| `--random-input-len` | Specifies the average input sequence length. |
|
||||
| `--random-output-len` | Specifies the average output sequence length. |
|
||||
| `--max-concurrency` | Maximum number of in-flight requests. We recommend matching this with the “--max-num-seqs” flag used to launch the server. |
|
||||
| `--save-result --result-filename` | Output location for the performance benchmarking result. |
|
||||
|
||||
@ -0,0 +1,365 @@
|
||||
# Deployment Guide for TensorRT-LLM Llama3.3 70B FP8 and NVFP4
|
||||
|
||||
# Introduction
|
||||
|
||||
This deployment guide provides step-by-step instructions for running the Llama 3.3-70B Instruct model using TensorRT-LLM with FP8 and NVFP4 quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output.
|
||||
|
||||
The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.
|
||||
|
||||
# Access & Licensing
|
||||
|
||||
To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License ([https://ai.meta.com/resources/models-and-libraries/llama-downloads/](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). NVIDIA’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
|
||||
|
||||
# Prerequisites
|
||||
|
||||
GPU: NVIDIA Blackwell or Hopper Architecture
|
||||
OS: Linux
|
||||
Drivers: CUDA Driver 575 or Later
|
||||
Docker with NVIDIA Container Toolkit installed
|
||||
Python3 and python3-pip (Optional, for accuracy evaluation only)
|
||||
|
||||
# Models
|
||||
|
||||
* FP8 model: [Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8)
|
||||
* NVFP4 model: [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)
|
||||
|
||||
|
||||
Note that NVFP4 is only supported on NVIDIA Blackwell
|
||||
|
||||
# Deployment Steps
|
||||
|
||||
## Run Docker Container
|
||||
|
||||
Run the docker container using the TensorRT-LLM NVIDIA NGC image.
|
||||
|
||||
```shell
|
||||
docker run --rm -it \
|
||||
--ipc=host \
|
||||
--gpus all \
|
||||
-p 8000:8000 \
|
||||
-v ~/.cache:/root/.cache:rw \
|
||||
--name tensorrt_llm \
|
||||
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4 \
|
||||
/bin/bash
|
||||
```
|
||||
|
||||
Note:
|
||||
|
||||
* You can mount additional directories and paths using the \-v \<local\_path\>:\<path\> flag if needed, such as mounting the downloaded weight paths.
|
||||
* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to \~/.cache/huggingface/hub/ by default. This prevents having to redownload the weights each time you rerun the container. If the \~/.cache directory doesn’t exist please create it using mkdir \~/.cache
|
||||
* The command also maps port **8000** from the container to your host so you can access the LLM API endpoint from your host
|
||||
* See the [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
|
||||
|
||||
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
|
||||
|
||||
## Creating the TRT-LLM Server config
|
||||
|
||||
We create a YAML configuration file /tmp/config.yml for the TensorRT-LLM Server and populate it with the following recommended performance settings.
|
||||
|
||||
```shell
|
||||
EXTRA_LLM_API_FILE=/tmp/config.yml
|
||||
|
||||
cat << EOF > ${EXTRA_LLM_API_FILE}
|
||||
enable_attention_dp: false
|
||||
cuda_graph_config:
|
||||
enable_padding: true
|
||||
max_batch_size: 1024
|
||||
kv_cache_config:
|
||||
dtype: fp8
|
||||
EOF
|
||||
```
|
||||
|
||||
## Launch the TRT-LLM Server
|
||||
|
||||
Below is an example command to launch the TRT-LLM server with the Llama-3.3-70B-Instruct-FP8 model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.
|
||||
|
||||
```shell
|
||||
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--backend pytorch \
|
||||
--max_batch_size 1024 \
|
||||
--max_num_tokens 2048 \
|
||||
--max_seq_len 2048 \
|
||||
--kv_cache_free_gpu_memory_fraction 0.9 \
|
||||
--tp_size 1 \
|
||||
--ep_size 1 \
|
||||
--trust_remote_code \
|
||||
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
|
||||
```
|
||||
|
||||
After the server is set up, the client can now send prompt requests to the server and receive results.
|
||||
|
||||
## Configs and Parameters
|
||||
|
||||
These options are used directly on the command line when you start the `trtllm-serve` process.
|
||||
#### `--tp_size`
|
||||
|
||||
 **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.
|
||||
|
||||
#### `--ep_size`
|
||||
|
||||
 **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
|
||||
|
||||
#### `--kv_cache_free_gpu_memory_fraction`
|
||||
|
||||
 **Description:** A value between 0.0 and 1.0 that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
|
||||
|
||||
 **Recommendation:** If you experience OOM errors, try reducing this value to **0.8** or lower.
|
||||
|
||||
#### `--backend pytorch`
|
||||
|
||||
 **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
|
||||
|
||||
#### `--max_batch_size`
|
||||
|
||||
 **Description:** The maximum number of user requests that can be grouped into a single batch for processing.
|
||||
|
||||
#### `--max_num_tokens`
|
||||
|
||||
 **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
|
||||
|
||||
#### `--max_seq_len`
|
||||
|
||||
 **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.
|
||||
|
||||
#### `--trust_remote_code`
|
||||
|
||||
 **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
|
||||
|
||||
|
||||
#### Extra LLM API Options (YAML Configuration)
|
||||
|
||||
These options provide finer control over performance and are set within a YAML file passed to the trtllm-serve command via the \--extra\_llm\_api\_options argument.
|
||||
|
||||
#### `kv_cache_config`
|
||||
|
||||
 **Description**: A section for configuring the Key-Value (KV) cache.
|
||||
|
||||
 **Options**:
|
||||
|
||||
  dtype: Sets the data type for the KV cache.
|
||||
|
||||
  **Default**: auto (uses the data type specified in the model checkpoint).
|
||||
|
||||
#### `cuda_graph_config`
|
||||
|
||||
 **Description**: A section for configuring CUDA graphs to optimize performance.
|
||||
|
||||
 **Options**:
|
||||
|
||||
  enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance.
|
||||
|
||||
  **Default**: false
|
||||
|
||||
  max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created.
|
||||
|
||||
  **Default**: 0
|
||||
|
||||
  **Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option.
|
||||
|
||||
  batch\_sizes: A specific list of batch sizes to create CUDA graphs for.
|
||||
|
||||
  **Default**: None
|
||||
|
||||
#### `moe_config`
|
||||
|
||||
 **Description**: Configuration for Mixture-of-Experts (MoE) models.
|
||||
|
||||
 **Options**:
|
||||
|
||||
  backend: The backend to use for MoE operations.
|
||||
|
||||
  **Default**: CUTLASS
|
||||
|
||||
#### `attention_backend`
|
||||
|
||||
 **Description**: The backend to use for attention calculations.
|
||||
|
||||
 **Default**: TRTLLM
|
||||
|
||||
See the [https://github.com/nvidia/TensorRT-LLM/blob/main/tensorrt\_llm/llmapi/llm\_args.py\#L1894](https://github.com/nvidia/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/llm_args.py#L1894) TorchLlmArgs class for the full list of options which can be used in the extra\_llm\_api\_options`.`
|
||||
|
||||
# Testing API Endpoint
|
||||
|
||||
## Basic Test
|
||||
|
||||
Start a new terminal on the host to test the TensorRT-LLM server you just launched.
|
||||
|
||||
You can query the health/readiness of the server using:
|
||||
|
||||
```shell
|
||||
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
|
||||
```
|
||||
|
||||
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
|
||||
|
||||
After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.
|
||||
|
||||
```shell
|
||||
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
|
||||
"model": "nvidia/Llama-3.3-70B-Instruct-FP8",
|
||||
"prompt": "Where is New York?",
|
||||
"max_tokens": 16,
|
||||
"temperature": 0
|
||||
}'
|
||||
```
|
||||
|
||||
Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence.
|
||||
|
||||
```json
|
||||
{"id":"cmpl-bc1393d529ce485c961d9ffee5b25d72","object":"text_completion","created":1753843963,"model":"nvidia/Llama-3.3-70B-Instruct-FP8","choices":[{"index":0,"text":" New York is a state located in the northeastern United States. It is bordered by","token_ids":null,"logprobs":null,"context_logits":null,"finish_reason":"length","stop_reason":null,"disaggregated_params":null}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16},"prompt_token_ids":null}
|
||||
```
|
||||
|
||||
## Troubleshooting Tips
|
||||
|
||||
* If you encounter CUDA out-of-memory errors, try reducing max\_batch\_size or max\_seq\_len
|
||||
* Ensure your model checkpoints are compatible with the expected format
|
||||
* For performance issues, check GPU utilization with nvidia-smi while the server is running
|
||||
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
|
||||
* For connection issues, make sure port 8000 is not being used by another application
|
||||
|
||||
## Running Evaluations to Verify Accuracy (Optional)
|
||||
|
||||
We use the lm-eval tool to test the model’s accuracy. For more information see [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
||||
|
||||
To run the evaluation harness exec into the running TensorRT-LLM container and install with this command:
|
||||
|
||||
```shell
|
||||
docker exec -it tensorrt_llm /bin/bash
|
||||
|
||||
pip install lm_eval
|
||||
```
|
||||
|
||||
FP8 command for GSM8K
|
||||
|
||||
* Note: The tokenizer will add BOS (beginning of sentence token) before input prompt by default which leads to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, set add\_special\_tokens=False to avoid it.
|
||||
|
||||
```
|
||||
MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP8
|
||||
|
||||
lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp8.gsm8k
|
||||
```
|
||||
|
||||
Sample result in Blackwell.
|
||||
|
||||
```
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9348|± |0.0068|
|
||||
| | |strict-match | 5|exact_match|↑ |0.8870|± |0.0087|
|
||||
```
|
||||
|
||||
FP4 command for GSM8K
|
||||
|
||||
* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set add\_special\_tokens=False to avoid it.
|
||||
|
||||
```shell
|
||||
MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP4
|
||||
|
||||
lm_eval --model local-completions --tasks gsm8k --batch_size 256 --gen_kwargs temperature=0.0,add_special_tokens=False --num_fewshot 5 --model_args model=${MODEL_PATH},base_url=http://localhost:8000/v1/completions,num_concurrent=32,max_retries=20,tokenized_requests=False --log_samples --output_path trtllm.fp4.gsm8k
|
||||
```
|
||||
|
||||
Sample result in Blackwell
|
||||
|
||||
```shell
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9356|± |0.0068|
|
||||
| | |strict-match | 5|exact_match|↑ |0.8393|± |0.0101|
|
||||
```
|
||||
|
||||
# Benchmarking Performance
|
||||
|
||||
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in “benchmark\_serving.py” script. To do this first creating a wrapper [bench.sh](http://bench.sh) script.
|
||||
|
||||
```shell
|
||||
cat <<EOF > bench.sh
|
||||
concurrency_list="1 2 4 8 16 32 64 128 256"
|
||||
multi_round=5
|
||||
isl=1024
|
||||
osl=1024
|
||||
result_dir=/tmp/llama3.3_output
|
||||
|
||||
for concurrency in ${concurrency_list}; do
|
||||
num_prompts=$((concurrency * multi_round))
|
||||
python -m tensorrt_llm.serve.scripts.benchmark_serving \
|
||||
--model nvidia/Llama-3.3-70B-Instruct-FP8 \
|
||||
--backend openai \
|
||||
--dataset-name "random" \
|
||||
--random-input-len ${isl} \
|
||||
--random-output-len ${osl} \
|
||||
--random-prefix-len 0 \
|
||||
--random-ids \
|
||||
--num-prompts ${num_prompts} \
|
||||
--max-concurrency ${concurrency} \
|
||||
--ignore-eos \
|
||||
--tokenize-on-client \
|
||||
--percentile-metrics "ttft,tpot,itl,e2el"
|
||||
done
|
||||
EOF
|
||||
chmod +x bench.sh
|
||||
```
|
||||
|
||||
To benchmark the FP4 model, replace \--model nvidia/Llama-3.3-70B-Instruct-FP8 with \--model nvidia/Llama-3.3-70B-Instruct-FP4.
|
||||
|
||||
If you want to save the results to a file add the following options.
|
||||
|
||||
```shell
|
||||
--save-result \
|
||||
--result-dir "${result_dir}" \
|
||||
--result-filename "concurrency_${concurrency}.json"
|
||||
```
|
||||
|
||||
For more benchmarking options see. [https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
|
||||
|
||||
Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.
|
||||
|
||||
```shell
|
||||
./bench.sh
|
||||
```
|
||||
|
||||
Sample TensorRT-LLM serving benchmark output. Your results may vary due to ongoing software optimizations.
|
||||
|
||||
```
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 16
|
||||
Benchmark duration (s): 17.66
|
||||
Total input tokens: 16384
|
||||
Total generated tokens: 16384
|
||||
Request throughput (req/s): [result]
|
||||
Output token throughput (tok/s): [result]
|
||||
Total Token throughput (tok/s): [result]
|
||||
User throughput (tok/s): [result]
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): [result]
|
||||
Median TTFT (ms): [result]
|
||||
P99 TTFT (ms): [result]
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): [result]
|
||||
Median TPOT (ms): [result]
|
||||
P99 TPOT (ms): [result]
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): [result]
|
||||
Median ITL (ms): [result]
|
||||
P99 ITL (ms): [result]
|
||||
----------------End-to-end Latency----------------
|
||||
Mean E2EL (ms): [result]
|
||||
Median E2EL (ms): [result]
|
||||
P99 E2EL (ms): [result]
|
||||
==================================================
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
* Median Time to First Token (TTFT)
|
||||
* The typical time elapsed from when a request is sent until the first output token is generated.
|
||||
* Median Time Per Output Token (TPOT)
|
||||
* The typical time required to generate each token *after* the first one.
|
||||
* Median Inter-Token Latency (ITL)
|
||||
* The typical time delay between the completion of one token and the completion of the next.
|
||||
* Median End-to-End Latency (E2EL)
|
||||
* The typical total time from when a request is submitted until the final token of the response is received.
|
||||
* Total Token Throughput
|
||||
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user