mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-26 05:32:57 +08:00
214 lines
7.2 KiB
Plaintext
214 lines
7.2 KiB
Plaintext
# Build TensorRT-LLM
|
|
|
|
- [Overview](#overview)
|
|
- [Fetch the Sources](#fetch-the-sources)
|
|
- [Build TensorRT-LLM in One Step](#build-tensorrt-llm-in-one-step)
|
|
- [Build Step-by-step](#build-step-by-step)
|
|
- [Create the Container](#create-the-container)
|
|
- [On Systems with GNU `make`](#on-systems-with-gnu-make)
|
|
- [On Systems without GNU `make`](#on-systems-without-gnu-make)
|
|
- [Build TensorRT-LLM](#build-tensorrt-llm)
|
|
- [Link with the TensorRT-LLM C++ Runtime](#link-with-the-tensorrt-llm-c++-runtime)
|
|
- [Supported C++ Header Files](#supported-c++-header-files)
|
|
|
|
## Overview
|
|
|
|
This document contains instructions to build TensorRT-LLM from sources. TensorRT-LLM depends on the latest versions of
|
|
TensorRT and
|
|
[Polygraphy](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy)
|
|
which are distributed separately, and should be copied into this repository.
|
|
|
|
We recommend the use of [Docker](https://www.docker.com) to build and run
|
|
TensorRT-LLM. Instructions to install an environment to run Docker containers
|
|
for the NVIDIA platform can be found
|
|
[here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
|
|
|
|
## Fetch the Sources
|
|
|
|
The first step to build TensorRT-LLM is to fetch the sources:
|
|
|
|
```bash
|
|
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
|
|
apt-get update && apt-get -y install git git-lfs
|
|
|
|
git clone https://github.com/NVIDIA/TensorRT-LLM.git
|
|
cd TensorRT-LLM
|
|
git submodule update --init --recursive
|
|
git lfs install
|
|
git lfs pull
|
|
```
|
|
|
|
## Build TensorRT-LLM in One Step
|
|
|
|
TensorRT-LLM contains a simple command to create a Docker image:
|
|
|
|
```bash
|
|
make -C docker release_build
|
|
```
|
|
|
|
It is possible to add the optional argument `CUDA_ARCHS="<list of architectures
|
|
in CMake format>"` to specify which architectures should be supported by
|
|
TensorRT-LLM. It restricts the supported GPU architectures but helps reduce
|
|
compilation time:
|
|
|
|
```bash
|
|
# Restrict the compilation to Ada and Hopper architectures.
|
|
make -C docker release_build CUDA_ARCHS="89-real;90-real"
|
|
```
|
|
|
|
Once the image is built, the Docker container can be executed using:
|
|
|
|
```bash
|
|
make -C docker release_run
|
|
```
|
|
|
|
The `make` command supports the `LOCAL_USER=1` argument to switch to the local
|
|
user account instead of `root` inside the container. The examples of
|
|
TensorRT-LLM are installed in directory `/app/tensorrt_llm/examples`.
|
|
|
|
## Build Step-by-step
|
|
|
|
For users looking for more flexibility, TensorRT-LLM has commands to create and
|
|
run a development container in which TensorRT-LLM can be built.
|
|
|
|
### Create the Container
|
|
|
|
#### On Systems with GNU `make`
|
|
|
|
The following command creates a Docker image for development:
|
|
|
|
```bash
|
|
make -C docker build
|
|
```
|
|
|
|
The image will be tagged locally with `tensorrt_llm/devel:latest`. To run the
|
|
container, use the following command:
|
|
|
|
```bash
|
|
make -C docker run
|
|
```
|
|
|
|
For users who prefer to work with their own user account in that container
|
|
instead of `root`, the option `LOCAL_USER=1` must be added to the above command
|
|
above:
|
|
|
|
```bash
|
|
make -C docker run LOCAL_USER=1
|
|
```
|
|
|
|
#### On Systems Without GNU `make`
|
|
|
|
On systems without GNU `make` or shell support, the Docker image for
|
|
development can be built using:
|
|
|
|
```bash
|
|
docker build --pull \
|
|
--target devel \
|
|
--file docker/Dockerfile.multi \
|
|
--tag tensorrt_llm/devel:latest \
|
|
.
|
|
```
|
|
|
|
The container can then be run using:
|
|
|
|
```bash
|
|
docker run --rm -it \
|
|
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
|
|
--volume ${PWD}:/code/tensorrt_llm \
|
|
--workdir /code/tensorrt_llm \
|
|
tensorrt_llm/devel:latest
|
|
```
|
|
|
|
### Build TensorRT-LLM
|
|
|
|
Once in the container, TensorRT-LLM can be built from source using:
|
|
|
|
```bash
|
|
# To build the TensorRT-LLM code.
|
|
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt
|
|
|
|
# Deploy TensorRT-LLM in your environment.
|
|
pip install ./build/tensorrt_llm*.whl
|
|
```
|
|
|
|
By default, `build_wheel.py` enables incremental builds. To clean the build
|
|
directory, add the `--clean` option:
|
|
|
|
```bash
|
|
python3 ./scripts/build_wheel.py --clean --trt_root /usr/local/tensorrt
|
|
```
|
|
|
|
It is possible to restrict the compilation of TensorRT-LLM to specific CUDA
|
|
architectures. For that purpose, the `build_wheel.py` script accepts a
|
|
semicolon separated list of CUDA architecture as shown in the following
|
|
example:
|
|
|
|
```bash
|
|
# Build TensorRT-LLM for Ampere.
|
|
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --trt_root /usr/local/tensorrt
|
|
```
|
|
|
|
The list of supported architectures can be found in the
|
|
[`CMakeLists.txt`](source:cpp/CMakeLists.txt) file.
|
|
|
|
### Build the Python Bindings for the C++ Runtime
|
|
|
|
The C++ Runtime, in particular, [`GptSession`](source:cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
|
|
Python via [bindings](source:cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
|
|
explicitly activated during compilation time. The corresponding option `--python_bindings` can be specified
|
|
to `build_wheel.py` in the standard way:
|
|
|
|
```bash
|
|
python3 ./scripts/build_wheel.py --python_bindings --trt_root /usr/local/tensorrt
|
|
```
|
|
|
|
After installing the resulting wheel as described above, the C++ Runtime bindings will be available in
|
|
package `tensorrt_llm.bindings`. Running `help` on this package in a Python interpreter will provide on overview of the
|
|
relevant classes. The [associated unit tests](source:tests/bindings) should also be consulted for understanding the API.
|
|
|
|
### Link with the TensorRT-LLM C++ Runtime
|
|
|
|
The `build_wheel.py` script will also compile the library containing the C++
|
|
runtime of TensorRT-LLM. If Python support and `torch` modules are not
|
|
required, the script provides the option `--cpp_only` which restricts the build
|
|
to the C++ runtime only:
|
|
|
|
```bash
|
|
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --cpp_only --clean
|
|
```
|
|
|
|
This is particularly useful to avoid linking problems which may be introduced
|
|
by particular versions of `torch` related to the [dual ABI support of
|
|
GCC](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). The
|
|
option `--clean` will remove the build directory before building. The default
|
|
build directory is `cpp/build`, which may be overridden using the option
|
|
`--build_dir`. Run `build_wheel.py --help` for an overview of all supported
|
|
options.
|
|
|
|
Clients may choose to link against the shared or the static version of the
|
|
library. These libraries can be found in the following locations:
|
|
|
|
```bash
|
|
cpp/build/tensorrt_llm/libtensorrt_llm.so
|
|
cpp/build/tensorrt_llm/libtensorrt_llm_static.a
|
|
```
|
|
|
|
In addition, one needs to link against the library containing the LLM plugins
|
|
for TensorRT available here:
|
|
|
|
```bash
|
|
cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so
|
|
```
|
|
|
|
### Supported C++ Header Files
|
|
|
|
When using TensorRT-LLM, you need to add the `cpp` and `cpp/include`
|
|
directories to the project's include paths. Only header files contained in
|
|
`cpp/include` are part of the supported API and may be directly included. Other
|
|
headers contained under `cpp` should not be included directly since they might
|
|
change in future versions.
|
|
|
|
For examples of how to use the C++ runtime, see the unit tests in
|
|
[gptSessionTest.cpp](source:cpp/tests/runtime/gptSessionTest.cpp) and the related
|
|
[CMakeLists.txt](source:cpp/tests/CMakeLists.txt) file.
|