| .github | ||
| 3rdparty | ||
| benchmarks | ||
| cpp | ||
| docker | ||
| docs | ||
| examples | ||
| scripts | ||
| tensorrt_llm | ||
| tests | ||
| windows | ||
| .clang-format | ||
| .dockerignore | ||
| .gitattributes | ||
| .gitignore | ||
| .gitmodules | ||
| .pre-commit-config.yaml | ||
| CHANGELOG.md | ||
| LICENSE | ||
| README.md | ||
| requirements-dev-windows.txt | ||
| requirements-dev.txt | ||
| requirements-windows.txt | ||
| requirements.txt | ||
| setup.cfg | ||
| setup.py | ||
TensorRT-LLM
A TensorRT Toolbox for Optimized Large Language Model Inference
Architecture | Results | Examples | Documentation
Latest News
- [Weekly] Check out @NVIDIAAIDev & NVIDIA AI LinkedIn for the latest updates!
- [2024/02/06] 🚀 Speed up inference with SOTA quantization techniques in TRT-LLM
- [2024/01/30] New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- [2023/12/04] Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
- [2023/11/27] SageMaker LMI now supports TensorRT-LLM - improves throughput by 60%, compared to previous version
- [2023/11/13] H200 achieves nearly 12,000 tok/sec on Llama2-13B
- [2023/10/22] 🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙
- [2023/10/19] Getting Started Guide - Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available
- [2023/10/17] Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
TensorRT-LLM Overview
TensorRT-LLM is an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).
The TensorRT-LLM Python API architecture looks similar to the
PyTorch API. It provides a
functional module containing functions like
einsum, softmax, matmul or view. The layers
module bundles useful building blocks to assemble LLMs; like an Attention
block, a MLP or the entire Transformer layer. Model-specific components,
like GPTAttention or BertAttention, can be found in the
models module.
TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. Refer to the Support Matrix for a list of supported models.
To maximize performance and reduce memory footprint, TensorRT-LLM allows the
models to be executed using different quantization modes (refer to
support matrix). TensorRT-LLM supports
INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as
well as a complete implementation of the
SmoothQuant technique.
Getting Started
To get started with TensorRT-LLM, visit our documentation: