mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Mike Iovine <miovine@nvidia.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> Signed-off-by: qgai <qgai@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Vincent Zhang <vinczhang@nvidia.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Michal Guzek <mguzek@nvidia.com> Signed-off-by: Michal Guzek <moraxu@users.noreply.github.com> Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com> Signed-off-by: leslie-fang25 <leslief@nvidia.com> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: yunruis <205571022+yunruis@users.noreply.github.com> Co-authored-by: sunnyqgg <159101675+sunnyqgg@users.noreply.github.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Vincent Zhang <vcheungyi@163.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: Leslie Fang <leslief@nvidia.com> Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
123 lines
4.9 KiB
Markdown
123 lines
4.9 KiB
Markdown
(quick-start-guide)=
|
|
|
|
# Quick Start Guide
|
|
|
|
This is the starting point to try out TensorRT LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT LLM.
|
|
|
|
|
|
## Launch Docker Container
|
|
|
|
The [TensorRT LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) maintained by NVIDIA contains all of the required dependencies pre-installed. You can start the container on a machine with NVIDIA GPUs via:
|
|
|
|
```bash
|
|
docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:x.y.z
|
|
```
|
|
|
|
|
|
(deploy-with-trtllm-serve)=
|
|
## Deploy Online Serving with trtllm-serve
|
|
|
|
You can use the `trtllm-serve` command to start an OpenAI compatible server to interact with a model.
|
|
To start the server, you can run a command like the following example inside a Docker container:
|
|
|
|
```bash
|
|
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
|
|
```
|
|
|
|
You may also deploy pre-quantized models to improve performance.
|
|
Ensure your GPU supports FP8 quantization before running the following:
|
|
|
|
```bash
|
|
trtllm-serve "nvidia/Qwen3-8B-FP8"
|
|
```
|
|
|
|
```{note}
|
|
If you are running `trtllm-serve` inside a Docker container, you have two options for sending API requests:
|
|
1. Expose a port (e.g., 8000) to allow external access to the server from outside the container.
|
|
2. Open a new terminal and use the following command to directly attach to the running container:
|
|
```bash
|
|
docker exec -it <container_id> bash
|
|
```
|
|
|
|
After the server has started, you can access well-known OpenAI endpoints such as `v1/chat/completions`.
|
|
Inference can then be performed using examples similar to the one provided below, from a separate terminal.
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-H "Accept: application/json" \
|
|
-d '{
|
|
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
|
|
"messages":[{"role": "system", "content": "You are a helpful assistant."},
|
|
{"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
|
|
"max_tokens": 32,
|
|
"temperature": 0
|
|
}'
|
|
```
|
|
|
|
_Example Output_
|
|
|
|
```json
|
|
{
|
|
"id": "chatcmpl-ef648e7489c040679d87ed12db5d3214",
|
|
"object": "chat.completion",
|
|
"created": 1741966075,
|
|
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
|
|
"choices": [
|
|
{
|
|
"index": 0,
|
|
"message": {
|
|
"role": "assistant",
|
|
"content": "New York is a city in the northeastern United States, located on the eastern coast of the state of New York.",
|
|
"tool_calls": []
|
|
},
|
|
"logprobs": null,
|
|
"finish_reason": "stop",
|
|
"stop_reason": null
|
|
}
|
|
],
|
|
"usage": {
|
|
"prompt_tokens": 43,
|
|
"total_tokens": 69,
|
|
"completion_tokens": 26
|
|
}
|
|
}
|
|
```
|
|
|
|
For detailed examples and command syntax, refer to the [trtllm-serve](commands/trtllm-serve/trtllm-serve.rst) section.
|
|
|
|
```{note}
|
|
Pre-configured settings for deploying popular models with `trtllm-serve` can be found in our [deployment guides](deployment-guide/index.rst).
|
|
```
|
|
|
|
## Run Offline Inference with LLM API
|
|
The LLM API is a Python API designed to facilitate setup and inference with TensorRT LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing model loading, optimization, and inference, all through a single `LLM` instance.
|
|
|
|
Here is a simple example to show how to use the LLM API with TinyLlama.
|
|
|
|
```{literalinclude} ../../examples/llm-api/quickstart_example.py
|
|
:language: python
|
|
:linenos:
|
|
```
|
|
|
|
You can also directly load pre-quantized models [quantized checkpoints on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in the LLM constructor.
|
|
To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples).
|
|
|
|
## Next Steps
|
|
|
|
In this Quick Start Guide, you have:
|
|
|
|
- Learned how to deploy a model with `trtllm-serve` for online serving
|
|
- Explored the LLM API for offline inference with TensorRT LLM
|
|
|
|
To continue your journey with TensorRT LLM, explore these resources:
|
|
|
|
- **[Installation Guide](installation/index.rst)** - Detailed installation instructions for different platforms
|
|
- **[Model-Specific Deployment Guides](deployment-guide/index.rst)** - Instructions for serving specific models with TensorRT LLM
|
|
- **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios
|
|
- **[Model Support](models/supported-models.md)** - Check which models are supported and how to add new ones
|
|
- **CLI Reference** - Explore TensorRT LLM command-line tools:
|
|
- [`trtllm-serve`](commands/trtllm-serve/trtllm-serve.rst) - Deploy models for online serving
|
|
- [`trtllm-bench`](commands/trtllm-bench.rst) - Benchmark model performance
|
|
- [`trtllm-eval`](commands/trtllm-eval.rst) - Evaluate model accuracy
|