mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
96 lines
4.4 KiB
Markdown
96 lines
4.4 KiB
Markdown
# LLM API Introduction
|
|
|
|
The LLM API is a high-level Python API designed to streamline LLM inference workflows.
|
|
|
|
It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo).
|
|
|
|
While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.
|
|
|
|
|
|
## Quick Start Example
|
|
A simple inference example with TinyLlama using the LLM API:
|
|
|
|
```{literalinclude} ../../../examples/llm-api/quickstart_example.py
|
|
:language: python
|
|
:linenos:
|
|
```
|
|
|
|
For more advanced usage including distributed inference, multimodal, and speculative decoding, please refer to this [README](../../../examples/llm-api/README.md).
|
|
|
|
## Model Input
|
|
|
|
The `LLM()` constructor accepts either a Hugging Face model ID or a local model path as input.
|
|
|
|
### 1. Using a Model from the Hugging Face Hub
|
|
|
|
To load a model directly from the [Hugging Face Model Hub]((https://huggingface.co/)), simply pass its model ID (i.e., repository name) to the LLM constructor. The model will be automatically downloaded:
|
|
|
|
```python
|
|
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
|
|
```
|
|
|
|
You can also use [quantized checkpoints](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) (FP4, FP8, etc) of popular models provided by NVIDIA in the same way.
|
|
|
|
### 2. Using a Local Hugging Face Model
|
|
|
|
To use a model from local storage, first download it manually:
|
|
|
|
```console
|
|
git lfs install
|
|
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
|
|
```
|
|
|
|
Then, load the model by specifying a local directory path:
|
|
|
|
```python
|
|
llm = LLM(model=<local_path_to_model>)
|
|
```
|
|
|
|
> **Note:** Some models require accepting specific [license agreements]((https://ai.meta.com/resources/models-and-libraries/llama-downloads/)). Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.
|
|
|
|
|
|
## Tips and Troubleshooting
|
|
|
|
The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:
|
|
|
|
### RuntimeError: only rank 0 can start multi-node session, got 1
|
|
|
|
There is no need to add an `mpirun` prefix for launching single node multi-GPU inference with the LLM API.
|
|
|
|
For example, you can run `python llm_inference_distributed.py` to perform multi-GPU on a single node.
|
|
|
|
### Hang issue on Slurm Node
|
|
|
|
If you experience a hang or other issue on a node managed with Slurm, add prefix `mpirun -n 1 --oversubscribe --allow-run-as-root` to your launch script.
|
|
|
|
For example, try `mpirun -n 1 --oversubscribe --allow-run-as-root python llm_inference_distributed.py`.
|
|
|
|
### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
|
|
|
|
Because the LLM API relies on the `mpi4py` library, put the LLM class in a function and protect the main entrypoint to the program under the `__main__` namespace to avoid a [recursive spawn](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html#mpipoolexecutor) process in `mpi4py`.
|
|
|
|
This limitation is applicable for multi-GPU inference only.
|
|
|
|
### Cannot quit after generation
|
|
|
|
The LLM instance manages threads and processes, which may prevent its reference count from reaching zero. To address this issue, there are two common solutions:
|
|
1. Wrap the LLM instance in a function, as demonstrated in the quickstart guide. This will reduce the reference count and trigger the shutdown process.
|
|
2. Use LLM as an contextmanager, with the following code: `with LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it goes out of the `with`-statement block.
|
|
|
|
### Single node hanging when using `docker run --net=host`
|
|
|
|
The root cause may be related to `mpi4py`. There is a [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) suggesting a change from `--net=host` to `--ipc=host`, or setting the following environment variables:
|
|
|
|
```bash
|
|
export OMPI_MCA_btl_tcp_if_include=lo
|
|
export OMPI_MCA_oob_tcp_if_include=lo
|
|
```
|
|
|
|
Another option to improve compatibility with `mpi4py` is to launch the task using:
|
|
|
|
```bash
|
|
mpirun -n 1 --oversubscribe --allow-run-as-root python my_llm_task.py
|
|
```
|
|
|
|
This command can help avoid related runtime issues.
|