mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

[TRTLLM-5990][doc] trtllm-serve doc improvement. (#5220 )

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

2025-08-05 13:04:01 +08:00

5.0 KiB

Raw Blame History

(quick-start-guide)=

Quick Start Guide

This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT-LLM.

Installation

There are multiple ways to install and run TensorRT-LLM. For most users, the options below should be ordered from simple to complex. The approaches are equivalent in terms of the supported features.

Note: This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Pre-built release wheels on PyPI (see )
Building from source

The following examples can most easily be executed using the prebuilt Docker release container available on NGC (see also release.md on GitHub). Ensure to run these commands as a user with appropriate permissions, preferably root, to streamline the setup process.

Launch Docker on a node with NVIDIA GPUs deployed.

docker run --ipc host --gpus all -it nvcr.io/nvidia/tensorrt-llm/release

Run Offline inference with LLM API

The LLM API is a Python API designed to facilitate setup and inference with TensorRT-LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing checkpoint conversion, engine building, engine loading, and model inference, all through a single Python object.

Here is a simple example to show how to use the LLM API with TinyLlama.

    :language: python
    :linenos:

You can also directly load TensorRT Model Optimizer's quantized checkpoints on Hugging Face in the LLM constructor. To learn more about the LLM API, check out the and .

(deploy-with-trtllm-serve)=

Deploy online serving with trtllm-serve

You can use the trtllm-serve command to start an OpenAI compatible server to interact with a model. To start the server, you can run a command like the following example inside a Docker container:

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Note

If you are running trtllm-server inside a Docker container, you have two options for sending API requests:

Expose port 8000 to access the server from outside the container.

Open a new terminal and use the following command to directly attach to the running container:

docker exec -it <container_id> bash

After the server has started, you can access well-known OpenAI endpoints such as v1/chat/completions. Inference can then be performed using examples similar to the one provided below, from a separate terminal.

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 32,
        "temperature": 0
    }'

Example Output

{
  "id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
  "object": "chat.completion",
  "created": 1741966075,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 43,
    "total_tokens": 69,
    "completion_tokens": 26
  }
}

For detailed examples and command syntax, refer to the trtllm-serve section.

Expose port 8000 to access the server from outside the container.
Open a new terminal and use the following command to directly attach to the running container:

docker exec -it <container_id> bash

Next Steps

In this Quick Start Guide, you:

Saw an example of the LLM API
Learned about deploying a model with trtllm-serve

For more examples, refer to:

examples for showcases of how to run a quick benchmark on latest LLMs.

5.0 KiB Raw Blame History