TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Go to file
2024-07-24 09:31:27 +08:00
.github Update TensorRT-LLM (#1598) 2024-05-14 16:43:41 +08:00
3rdparty Update TensorRT-LLM (#1492) 2024-04-24 14:44:22 +08:00
benchmarks Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
cpp open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010) 2024-07-24 05:48:06 +08:00
docker Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
docs Update README (#2012) 2024-07-24 09:31:27 +08:00
examples open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010) 2024-07-24 05:48:06 +08:00
scripts Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
tensorrt_llm open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010) 2024-07-24 05:48:06 +08:00
tests Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
windows Update TensorRT-LLM (#1954) 2024-07-16 15:30:25 +08:00
.clang-format Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
.dockerignore Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
.gitattributes Update TensorRT-LLM (#1554) 2024-05-07 23:34:28 +08:00
.gitignore Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
.gitmodules Update TensorRT-LLM (#524) 2023-12-01 22:27:51 +08:00
.pre-commit-config.yaml Update TensorRT-LLM (#1725) 2024-06-04 20:26:32 +08:00
LICENSE Initial commit 2023-09-20 00:29:41 -07:00
README.md Update README (#2012) 2024-07-24 09:31:27 +08:00
requirements-dev-windows.txt Update TensorRT-LLM (#1725) 2024-06-04 20:26:32 +08:00
requirements-dev.txt Update TensorRT-LLM (#1793) 2024-06-18 18:18:23 +08:00
requirements-windows.txt Update TensorRT-LLM (#1918) 2024-07-09 14:42:22 +08:00
requirements.txt Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
setup.cfg Initial commit 2023-09-20 00:29:41 -07:00
setup.py Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation


Latest News

  • [2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized 🦙 400 tok/s - per node 🦙 37 tok/s - per user 🦙 1 node inference ➡️ link
  • [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference: MultiLingual NIM LoRA tuned adaptors➡️ Tech blog

  • [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100. ➡️ Tech blog

  • [2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ais solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ➡️ link

  • [2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization➡️ link

  • [2024/06/18] 🧰Deploying ComfyUI with TensorRT? Heres your setup guide ➡️ link

  • [2024/06/11] #TensorRT Weight-Stripped Engines Technical Deep Dive for serious coders +99% compression 1 set of weights → ** GPUs 0 performance loss ** models…LLM, CNN, etc.➡️ link

  • [2024/06/04] #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers 🦸 🎥 Demo: ➡️ link 📗 DIY notebook: ➡️ link

  • [2024/05/28] #TensorRT weight stripping for ResNet-50 +99% compression 1 set of weights → ** GPUs\ 0 performance loss ** models…LLM, CNN, etc 👀 📚 DIY ➡️ link

  • [2024/05/21] @modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM 👀 📚 Marvelous Modal Manual: Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs ➡️ link

  • [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques quantization sparsity QAT ➡️ blog

  • [2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚➡️ link

Previous News

TensorRT-LLM Overview

TensorRT-LLM is an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

The TensorRT-LLM Python API architecture looks similar to the PyTorch API. It provides a functional module containing functions like einsum, softmax, matmul or view. The layers module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the models module.

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs. Refer to the Support Matrix for a list of supported models.

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (refer to support matrix). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community