Getting Started

Overview
Quick Start Guide
Release Notes

Installation

Installing on Linux
Building from Source Code on Linux
Installing on Windows
Building from Source Code on Windows

Architecture

TensorRT-LLM Architecture
Model Definition
Compilation
Runtime
Multi-GPU and Multi-Node Support
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model

Advanced

Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
Graph Rewriting Module
The Batch Manager in TensorRT-LLM
Inference Request
Run gpt-2b + LoRA using GptManager / cpp runtime
Expert Parallelism in TensorRT-LLM

Performance

Overview
Best Practices for Tuning the Performance of TensorRT-LLM
Performance Analysis

Reference

Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM

C++ API

Runtime

Python API

Layers
Functionals
Models
Plugin
Quantization
Runtime

Blogs

H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget

tensorrt_llm

Overview: module code

All modules for which code is available

tensorrt_llm.functional
tensorrt_llm.layers.activation
tensorrt_llm.layers.attention
tensorrt_llm.layers.cast
tensorrt_llm.layers.conv
tensorrt_llm.layers.embedding
tensorrt_llm.layers.linear
tensorrt_llm.layers.mlp
tensorrt_llm.layers.normalization
tensorrt_llm.layers.pooling
tensorrt_llm.models.baichuan.model
tensorrt_llm.models.bert.model
tensorrt_llm.models.bloom.model
tensorrt_llm.models.chatglm.model
tensorrt_llm.models.enc_dec.model
tensorrt_llm.models.falcon.model
tensorrt_llm.models.gemma.model
tensorrt_llm.models.gpt.model
tensorrt_llm.models.gptj.model
tensorrt_llm.models.gptneox.model
tensorrt_llm.models.llama.model
tensorrt_llm.models.mamba.model
tensorrt_llm.models.medusa.model
tensorrt_llm.models.modeling_utils
tensorrt_llm.models.mpt.model
tensorrt_llm.models.opt.model
tensorrt_llm.models.phi.model
tensorrt_llm.models.quantized.quant
tensorrt_llm.models.qwen.model
tensorrt_llm.plugin.plugin
tensorrt_llm.quantization.mode
tensorrt_llm.quantization.quantize_by_ammo
tensorrt_llm.runtime.generation
tensorrt_llm.runtime.kv_cache_manager
tensorrt_llm.runtime.model_runner
tensorrt_llm.runtime.model_runner_cpp
tensorrt_llm.runtime.session

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.