mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

[TRTLLM-5930][doc] 1.0 Documentation. (#6696 )

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

2025-09-04 05:29:43 -04:00

4.2 KiB

Raw Blame History

(quick-start-guide)=

Quick Start Guide

This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT-LLM.

Launch Docker on a node with NVIDIA GPUs deployed

docker run --ipc host --gpus all -p 8000:8000 -it nvcr.io/nvidia/tensorrt-llm/release

(deploy-with-trtllm-serve)=

Deploy online serving with trtllm-serve

You can use the trtllm-serve command to start an OpenAI compatible server to interact with a model. To start the server, you can run a command like the following example inside a Docker container:

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

If you are running trtllm-server inside a Docker container, you have two options for sending API requests:
1. Expose a port (e.g., 8000) to allow external access to the server from outside the container.
2. Open a new terminal and use the following command to directly attach to the running container:
```bash
docker exec -it <container_id> bash

After the server has started, you can access well-known OpenAI endpoints such as v1/chat/completions. Inference can then be performed using examples similar to the one provided below, from a separate terminal.

curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Accept: application/json" \
    -d '{
        "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
        "max_tokens": 32,
        "temperature": 0
    }'

Example Output

{
  "id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
  "object": "chat.completion",
  "created": 1741966075,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 43,
    "total_tokens": 69,
    "completion_tokens": 26
  }
}

For detailed examples and command syntax, refer to the trtllm-serve section.

Run Offline inference with LLM API

The LLM API is a Python API designed to facilitate setup and inference with TensorRT-LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing model loading, optimization, and inference, all through a single LLM instance.

Here is a simple example to show how to use the LLM API with TinyLlama.

    :language: python
    :linenos:

You can also directly load pre-quantized models quantized checkpoints on Hugging Face in the LLM constructor. To learn more about the LLM API, check out the and .

Next Steps

In this Quick Start Guide, you have:

Learned how to deploy a model with trtllm-serve for online serving
Explored the LLM API for offline inference with TensorRT-LLM

To continue your journey with TensorRT-LLM, explore these resources:

Installation Guide - Detailed installation instructions for different platforms
Deployment Guide - Comprehensive examples for deploying LLM inference in various scenarios
Model Support - Check which models are supported and how to add new ones
CLI Reference - Explore TensorRT-LLM command-line tools:
- trtllm-serve - Deploy models for online serving
- trtllm-bench - Benchmark model performance
- trtllm-eval - Evaluate model accuracy

4.2 KiB Raw Blame History