TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Go to file
2025-12-16 15:57:14 -08:00
.github TensorRT-LLM v0.10 update 2024-06-05 20:43:25 +08:00
3rdparty TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
benchmarks TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
cpp open source v0.12-jetson 2024-11-05 10:17:16 +00:00
docker TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
docs TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
examples open source v0.12-jetson 2024-11-05 10:17:16 +00:00
scripts open source v0.12-jetson 2024-11-05 10:17:16 +00:00
tensorrt_llm [#9432][fix] Resolve NameError in memory profiler for v0.12.0-jetson branch. (#9610) 2025-12-16 15:57:14 -08:00
tests TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
windows TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
.clang-format Update TensorRT-LLM Release branch (#1445) 2024-04-12 17:59:19 +08:00
.dockerignore Update TensorRT-LLM Release branch (#1192) 2024-02-29 17:20:55 +08:00
.gitattributes TensorRT-LLM v0.10 update 2024-06-05 20:43:25 +08:00
.gitignore TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
.gitmodules Update TensorRT-LLM (#506) 2023-11-30 16:46:22 +08:00
.pre-commit-config.yaml TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
LICENSE Initial commit 2023-09-20 00:29:41 -07:00
README4Jetson.md open source v0.12-jetson 2024-11-05 10:17:16 +00:00
README.md TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
requirements-dev-jetson.txt open source v0.12-jetson 2024-11-05 10:17:16 +00:00
requirements-dev-windows.txt TensorRT-LLM v0.11 Update (#1969) 2024-07-17 20:45:02 +08:00
requirements-dev.txt TensorRT-LLM v0.11 Update (#1969) 2024-07-17 20:45:02 +08:00
requirements-jetson.txt open source v0.12-jetson 2024-11-05 10:17:16 +00:00
requirements-windows.txt TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
requirements.txt TensorRT-LLM v0.12 Update (#2164) 2024-08-29 17:25:07 +08:00
setup.cfg Initial commit 2023-09-20 00:29:41 -07:00
setup.py open source v0.12-jetson 2024-11-05 10:17:16 +00:00

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation


Latest News

  • [2024/08/13] 🐍 DIY Code Completion with #Mamba #TensorRT #LLM for speed 🤖 NIM for ease ☁️ deploy anywhere ➡️ link
  • [2024/08/06] 🗫 Multilingual Challenge Accepted 🗫 🤖 #TensorRT #LLM boosts low-resource languages like Hebrew, Indonesian and Vietnamese ➡️ link

  • [2024/07/30] Introducing🍊 @SliceXAI ELM Turbo 🤖 train ELM once #TensorRT #LLM optimize ☁️ deploy anywhere ➡️ link

  • [2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized 🦙 400 tok/s - per node 🦙 37 tok/s - per user 🦙 1 node inference ➡️ link

  • [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference: MultiLingual NIM LoRA tuned adaptors➡️ Tech blog

  • [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100. ➡️ Tech blog

  • [2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ais solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ➡️ link

  • [2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization➡️ link

  • [2024/06/18] 🧰Deploying ComfyUI with TensorRT? Heres your setup guide ➡️ link

  • [2024/06/11] #TensorRT Weight-Stripped Engines Technical Deep Dive for serious coders +99% compression 1 set of weights → ** GPUs 0 performance loss ** models…LLM, CNN, etc.➡️ link

  • [2024/06/04] #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers 🦸 🎥 Demo: ➡️ link 📗 DIY notebook: ➡️ link

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimziations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimized TensorRT engines. It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs via a PyTorch-like Python API. Refer to the Support Matrix for a list of supported models.

TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community