Getting Started

Overview
Quick Start Guide
Key Features
Release Notes

Installation

Installing on Linux
Building from Source Code on Linux
Installing on Windows
Building from Source Code on Windows
Installing on Grace Hopper

LLM API

API Introduction
API Reference

LLM API Examples

LLM Examples Introduction
Common Customizations
Examples

Model Definition API

Layers
Functionals
Models
Plugin
Quantization
Runtime

C++ API

Executor
Runtime

Command-Line Reference

trtllm-build
trtllm-serve

Architecture

TensorRT-LLM Architecture
Model Definition
Compilation
Runtime
Multi-GPU and Multi-Node Support
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model

Advanced

Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
Executor API
Graph Rewriting Module
Inference Request
Responses
Run gpt-2b + LoRA using GptManager / cpp runtime
Expert Parallelism in TensorRT-LLM
KV cache reuse
Speculative Sampling

Performance

Overview
Benchmarking
Best Practices
Performance Analysis

Reference

Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM

Blogs

H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget

tensorrt_llm

Examples
View page source

Examples

Scripts

Generate text with guided decoding
Generate text
Generate Text Asynchronously
Generate Text in Streaming
Generate text
Distributed LLM Generation
Control generated text using logits post processor
Generate Text Using Lookahead Decoding
Generate Text Using Medusa Decoding
Generate text with multiple LoRA adapters
Generation with Quantization
Automatic Parallelism with LLM

Previous Next

Copyright © 2024 NVIDIA Corporation

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact