Getting Started

Overview
Quick Start Guide
Key Features
Release Notes

Installation

Installing on Linux
Building from Source Code on Linux
Installing on Windows
Building from Source Code on Windows

LLM API Examples

LLM Examples Introduction
Common Customizations
Examples

LLM API

API Reference

Model Definition API

Layers
Functionals
Models
Plugin
Quantization
Runtime

C++ API

Executor
Runtime

Command-Line Reference

trtllm-build

Architecture

TensorRT-LLM Architecture
Model Definition
Compilation
Runtime
Multi-GPU and Multi-Node Support
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model

Advanced

Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
Graph Rewriting Module
The Batch Manager in TensorRT-LLM
Inference Request
Responses
Run gpt-2b + LoRA using GptManager / cpp runtime
Expert Parallelism in TensorRT-LLM

Performance

Overview
Best Practices for Tuning the Performance of TensorRT-LLM
Performance Analysis

Reference

Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM

Blogs

H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget

tensorrt_llm

Examples
View page source

Examples

Scripts

LLM Generate
LLM Generate Async
LLM Generate Async Streaming
LLM Generate Distributed
LLM Quantization
LLM Auto Parallel

Previous Next

Copyright © 2024 NVIDIA Corporation

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact