TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Go to file
2024-11-05 16:27:06 +08:00
.github Update TensorRT-LLM (#1598) 2024-05-14 16:43:41 +08:00
3rdparty Update TensorRT-LLM (#2389) 2024-10-29 22:24:38 +08:00
benchmarks Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
cpp Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
docker Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
docs Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
examples Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
scripts Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
tensorrt_llm Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
tests Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
windows open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273) 2024-09-30 13:51:19 +02:00
.clang-format Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
.dockerignore Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
.gitattributes Update TensorRT-LLM (#1554) 2024-05-07 23:34:28 +08:00
.gitignore Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
.gitmodules Update TensorRT-LLM (#2389) 2024-10-29 22:24:38 +08:00
.pre-commit-config.yaml open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297) 2024-10-08 12:19:19 +02:00
LICENSE Initial commit 2023-09-20 00:29:41 -07:00
README.md Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
requirements-dev-windows.txt Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
requirements-dev.txt Update TensorRT-LLM (#2389) 2024-10-29 22:24:38 +08:00
requirements-windows.txt Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
requirements.txt Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
setup.cfg Initial commit 2023-09-20 00:29:41 -07:00
setup.py Update TensorRT-LLM (#2363) 2024-10-22 20:27:35 +08:00

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation   |   Roadmap


Latest News

  • [2024/11/02] 🌟🌟🌟 NVIDIA and LlamaIndex Developer Contest 🙌 Enter for a chance to win prizes including an NVIDIA® GeForce RTX™ 4080 SUPER GPU, DLI credits, and more🙌 ➡️ link
  • [2024/10/28] 🏎️🏎️🏎️ NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models ➡️ link

  • [2024/10/22] New 📝 Step-by-step instructions on how to Optimize LLMs with NVIDIA TensorRT-LLM, Deploy the optimized models with Triton Inference Server, Autoscale LLMs deployment in a Kubernetes environment. 🙌 Technical Deep Dive: ➡️ link

  • [2024/10/07] 🚀🚀🚀Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries ➡️ link

  • [2024/09/29] 🌟 AI at Meta PyTorch + TensorRT v2.4 🌟 TensorRT 10.1 PyTorch 2.4 CUDA 12.4 Python 3.12 ➡️ link

  • [2024/09/17] NVIDIA TensorRT-LLM Meetup ➡️ link

  • [2024/09/17] Accelerating LLM Inference at Databricks with TensorRT-LLM ➡️ link

  • [2024/09/17] TensorRT-LLM @ Baseten ➡️ link

  • [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML ➡️ link

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimized TensorRT engines. It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs via a PyTorch-like Python API. Refer to the Support Matrix for a list of supported models.

TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community