TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Go to file
2025-04-16 14:42:50 +08:00
.github open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
3rdparty open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
benchmarks open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
cpp TensorRT-LLM v0.18.2 release (#3611) 2025-04-16 14:42:50 +08:00
docker TensorRT-LLM v0.18 release (#3231) 2025-04-02 17:01:16 +08:00
docs TensorRT-LLM v0.18.2 release (#3611) 2025-04-16 14:42:50 +08:00
examples TensorRT-LLM v0.18.2 release (#3611) 2025-04-16 14:42:50 +08:00
scripts open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
tensorrt_llm TensorRT-LLM v0.18.2 release (#3611) 2025-04-16 14:42:50 +08:00
tests TensorRT-LLM v0.18.1 release (#3389) 2025-04-09 09:06:37 +08:00
windows open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
.clang-format Update TensorRT-LLM Release branch (#1445) 2024-04-12 17:59:19 +08:00
.dockerignore Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00
.gitattributes TensorRT-LLM v0.10 update 2024-06-05 20:43:25 +08:00
.gitignore TensorRT-LLM v0.16 Release 2024-12-24 15:58:43 +08:00
.gitmodules TensorRT-LLM v0.16 Release 2024-12-24 15:58:43 +08:00
.pre-commit-config.yaml TensorRT-LLM v0.18 release (#3231) 2025-04-02 17:01:16 +08:00
LICENSE Initial commit 2023-09-20 00:29:41 -07:00
README.md TensorRT-LLM v0.18.2 release (#3611) 2025-04-16 14:42:50 +08:00
requirements-dev-windows.txt TensorRT-LLM v0.16 Release 2024-12-24 15:58:43 +08:00
requirements-dev.txt open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
requirements-windows.txt open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00
requirements.txt TensorRT-LLM v0.18.1 release (#3389) 2025-04-09 09:06:37 +08:00
setup.cfg Initial commit 2023-09-20 00:29:41 -07:00
setup.py open source 09df54c0cc99354a60bbc0303e3e8ea33a96bef0 (#2725) 2025-02-11 02:21:51 +00:00

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python python cuda trt version license

Architecture   |   Performance   |   Examples   |   Documentation   |   Roadmap


Latest News

  • [2025/01/07] 🌟 Getting Started with TensorRT-LLM ➡️ link

  • [2025/01/04] Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding ➡️ link

  • [2024/12/10] Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview ➡️ link

  • [2024/12/03] 🌟 Boost your AI inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. Learn how in this technical deep dive ➡️ link

  • [2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer A user-friendly GUI and tight integration with NVIDIA TensorRT that offers: Intuitive visualization of ONNX model graphs Quick tweaking of model architecture and parameters Detailed performance profiling with either ORT or TensorRT Easy building of TensorRT engines ➡️ link

  • [2024/11/26] 📣 Introducing TensorRT-LLM for Jetson AGX Orin, making it even easier to deploy on Jetson AGX Orin with initial support in JetPack 6.1 via the v0.12.0-jetson branch of the TensorRT-LLM repo. Pre-compiled TensorRT-LLM wheels & containers for easy integration Comprehensive guides & docs to get you started ➡️ link

  • [2024/11/21] NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 ➡️ link

  • [2024/11/19] Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs ➡️ link

  • [2024/11/09] 🚀🚀🚀 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot ➡️ link

  • [2024/11/09] NVIDIA advances the AI ecosystem with the AI model of LG AI Research 🙌 ➡️ link

  • [2024/11/02] 🌟🌟🌟 NVIDIA and LlamaIndex Developer Contest 🙌 Enter for a chance to win prizes including an NVIDIA® GeForce RTX™ 4080 SUPER GPU, DLI credits, and more🙌 ➡️ link

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimized TensorRT engines. It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs via a PyTorch-like Python API. Refer to the Support Matrix for a list of supported models.

TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community