TensorRT-LLMs/docs/source/quick-start-guide.md
Guoming Zhang cabda243f1
[TRTLLM-5930][doc] 1.0 Documentation. (#6696)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-04 05:29:43 -04:00

109 lines
4.2 KiB
Markdown

(quick-start-guide)=
# Quick Start Guide
This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT-LLM.
## Launch Docker on a node with NVIDIA GPUs deployed
```bash
docker run --ipc host --gpus all -p 8000:8000 -it nvcr.io/nvidia/tensorrt-llm/release
```
(deploy-with-trtllm-serve)=
## Deploy online serving with trtllm-serve
You can use the `trtllm-serve` command to start an OpenAI compatible server to interact with a model.
To start the server, you can run a command like the following example inside a Docker container:
```bash
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```
```{note}
If you are running trtllm-server inside a Docker container, you have two options for sending API requests:
1. Expose a port (e.g., 8000) to allow external access to the server from outside the container.
2. Open a new terminal and use the following command to directly attach to the running container:
```bash
docker exec -it <container_id> bash
```
After the server has started, you can access well-known OpenAI endpoints such as `v1/chat/completions`.
Inference can then be performed using examples similar to the one provided below, from a separate terminal.
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages":[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
"max_tokens": 32,
"temperature": 0
}'
```
_Example Output_
```json
{
"id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
"object": "chat.completion",
"created": 1741966075,
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 43,
"total_tokens": 69,
"completion_tokens": 26
}
}
```
For detailed examples and command syntax, refer to the [trtllm-serve](commands/trtllm-serve/trtllm-serve.rst) section.
## Run Offline inference with LLM API
The LLM API is a Python API designed to facilitate setup and inference with TensorRT-LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing model loading, optimization, and inference, all through a single `LLM` instance.
Here is a simple example to show how to use the LLM API with TinyLlama.
```{literalinclude} ../../examples/llm-api/quickstart_example.py
:language: python
:linenos:
```
You can also directly load pre-quantized models [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor.
To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples).
## Next Steps
In this Quick Start Guide, you have:
- Learned how to deploy a model with `trtllm-serve` for online serving
- Explored the LLM API for offline inference with TensorRT-LLM
To continue your journey with TensorRT-LLM, explore these resources:
- **[Installation Guide](installation/index.rst)** - Detailed installation instructions for different platforms
- **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios
- **[Model Support](models/supported-models.md)** - Check which models are supported and how to add new ones
- **CLI Reference** - Explore TensorRT-LLM command-line tools:
- [`trtllm-serve`](commands/trtllm-serve/trtllm-serve.rst) - Deploy models for online serving
- [`trtllm-bench`](commands/trtllm-bench.rst) - Benchmark model performance
- [`trtllm-eval`](commands/trtllm-eval.rst) - Evaluate model accuracy