TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Go to file
tburt-nv 0bcfdca6aa
Use NVIDIA-gha runners to collect test results (#2830)
Signed-off-by: Tyler Burt <tburt@nvidia.com>
2025-02-27 23:02:02 -05:00
.github Use NVIDIA-gha runners to collect test results (#2830) 2025-02-27 23:02:02 -05:00
3rdparty Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
benchmarks Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
cpp Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
docker Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
docs Add R1 perf data to latest news page (#2823) 2025-02-25 16:50:19 -08:00
examples Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
scripts Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
tensorrt_llm Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
tests Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
windows Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
.clang-format Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
.dockerignore Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
.gitattributes Update TensorRT-LLM (#1554) 2024-05-07 23:34:28 +08:00
.gitignore Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
.gitmodules Update TensorRT-LLM (#2532) 2024-12-04 21:16:56 +08:00
.pre-commit-config.yaml Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
LICENSE Initial commit 2023-09-20 00:29:41 -07:00
pyproject.toml Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
README.md Add R1 perf data to latest news page (#2823) 2025-02-25 16:50:19 -08:00
requirements-dev-windows.txt Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
requirements-dev.txt Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
requirements-windows.txt Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
requirements.txt Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
setup.py Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python python cuda trt version license

Architecture   |   Performance   |   Examples   |   Documentation   |   Roadmap


Latest News

  • [2025/02/25] 🌟 DeepSeek-R1 performance now optimized for Blackwell ➡️ link

HGX B200 (8 GPUs) vs HGX H200 (8 GPUs) vs 2 x HGX H100 (normalized to 8 GPUs for comparison). Input tokens not included in TPS calculations. TensorRT-LLM Version: 0.18.0.dev2025021800 (pre-release) used for Feb measurements, SGLang used for Jan measurements. Hopper numbers in FP8. B200 numbers in FP4. Max concurrency use case. ISL/OSL: 1K/1K.

  • [2025/01/07] 🌟 Getting Started with TensorRT-LLM ➡️ link

  • [2025/01/04] Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding ➡️ link

  • [2024/12/10] Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview ➡️ link

  • [2024/12/03] 🌟 Boost your AI inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. Learn how in this technical deep dive ➡️ link

  • [2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer A user-friendly GUI and tight integration with NVIDIA TensorRT that offers: Intuitive visualization of ONNX model graphs Quick tweaking of model architecture and parameters Detailed performance profiling with either ORT or TensorRT Easy building of TensorRT engines ➡️ link

  • [2024/11/26] 📣 Introducing TensorRT-LLM for Jetson AGX Orin, making it even easier to deploy on Jetson AGX Orin with initial support in JetPack 6.1 via the v0.12.0-jetson branch of the TensorRT-LLM repo. Pre-compiled TensorRT-LLM wheels & containers for easy integration Comprehensive guides & docs to get you started ➡️ link

  • [2024/11/21] NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 ➡️ link

  • [2024/11/19] Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs ➡️ link

  • [2024/11/09] 🚀🚀🚀 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot ➡️ link

  • [2024/11/09] NVIDIA advances the AI ecosystem with the AI model of LG AI Research 🙌 ➡️ link

  • [2024/11/02] 🌟🌟🌟 NVIDIA and LlamaIndex Developer Contest 🙌 Enter for a chance to win prizes including an NVIDIA® GeForce RTX™ 4080 SUPER GPU, DLI credits, and more🙌 ➡️ link

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimized TensorRT engines. It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs via a PyTorch-like Python API. Refer to the Support Matrix for a list of supported models.

TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community