Contents:

TensorRT-LLM Architecture
C++ GPT Runtime
The Batch Manager in TensorRT-LLM
Inference Request
Multi-head, Multi-query and Group-query Attention
Numerical Precision
Build from Source
Performance of TensorRT-LLM
How to debug
How to add a new model
Graph Rewriting Module
Memory Usage of TensorRT-LLM
New Workflow
Run gpt-2b + LoRA using GptManager / cpp runtime
Best Practices for Tuning the Performance of TensorRT-LLM
Performance Analysis of TensorRT-LLM

Python API

Layers
Functionals
Models
Plugin
Quantization
Runtime

C++ API

Runtime

Blogs

H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM

tensorrt_llm

Welcome to TensorRT-LLM’s documentation!
View page source

Welcome to TensorRT-LLM’s documentation!

Contents:

TensorRT-LLM Architecture
C++ GPT Runtime
The Batch Manager in TensorRT-LLM
Inference Request
Multi-head, Multi-query and Group-query Attention
Numerical Precision
Build from Source
Performance of TensorRT-LLM
How to debug
How to add a new model
Graph Rewriting Module
Memory Usage of TensorRT-LLM
New Workflow
Run gpt-2b + LoRA using GptManager / cpp runtime
Best Practices for Tuning the Performance of TensorRT-LLM
Performance Analysis of TensorRT-LLM

Python API

tensorrt_llm.layers
tensorrt_llm.functional
tensorrt_llm.models
tensorrt_llm.plugin
tensorrt_llm.quantization
tensorrt_llm.runtime

C++ API

cpp/runtime

Indices and tables

Index
Module Index
Search Page

Blogs

Next

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.