tensorrt_llm

Getting Started

  • Overview
  • Quick Start Guide
  • Release Notes

Installation

  • Installing on Linux
  • Building from Source Code on Linux
  • Installing on Windows
  • Building from Source Code on Windows

Architecture

  • TensorRT-LLM Architecture
  • Model Definition
  • Compilation
  • Runtime
  • Multi-GPU and Multi-Node Support
  • TensorRT-LLM Checkpoint
  • TensorRT-LLM Build Workflow
  • Adding a Model

Advanced

  • Multi-Head, Multi-Query, and Group-Query Attention
  • C++ GPT Runtime
  • Graph Rewriting Module
  • The Batch Manager in TensorRT-LLM
  • Inference Request
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Expert Parallelism in TensorRT-LLM

Performance

  • Overview
  • Best Practices for Tuning the Performance of TensorRT-LLM
  • Performance Analysis

Reference

  • Troubleshooting
  • Support Matrix
  • Numerical Precision
  • Memory Usage of TensorRT-LLM

C++ API

  • Runtime

Python API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

Blogs

  • H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
  • Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
  • Speed up inference with SOTA quantization techniques in TRT-LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
tensorrt_llm
  • Overview: module code

All modules for which code is available

  • tensorrt_llm.functional
  • tensorrt_llm.layers.activation
  • tensorrt_llm.layers.attention
  • tensorrt_llm.layers.cast
  • tensorrt_llm.layers.conv
  • tensorrt_llm.layers.embedding
  • tensorrt_llm.layers.linear
  • tensorrt_llm.layers.mlp
  • tensorrt_llm.layers.normalization
  • tensorrt_llm.layers.pooling
  • tensorrt_llm.models.baichuan.model
  • tensorrt_llm.models.bert.model
  • tensorrt_llm.models.bloom.model
  • tensorrt_llm.models.chatglm.model
  • tensorrt_llm.models.enc_dec.model
  • tensorrt_llm.models.falcon.model
  • tensorrt_llm.models.gemma.model
  • tensorrt_llm.models.gpt.model
  • tensorrt_llm.models.gptj.model
  • tensorrt_llm.models.gptneox.model
  • tensorrt_llm.models.llama.model
  • tensorrt_llm.models.mamba.model
  • tensorrt_llm.models.medusa.model
  • tensorrt_llm.models.modeling_utils
  • tensorrt_llm.models.mpt.model
  • tensorrt_llm.models.opt.model
  • tensorrt_llm.models.phi.model
  • tensorrt_llm.models.quantized.quant
  • tensorrt_llm.models.qwen.model
  • tensorrt_llm.plugin.plugin
  • tensorrt_llm.quantization.mode
  • tensorrt_llm.quantization.quantize_by_ammo
  • tensorrt_llm.runtime.generation
  • tensorrt_llm.runtime.kv_cache_manager
  • tensorrt_llm.runtime.model_runner
  • tensorrt_llm.runtime.model_runner_cpp
  • tensorrt_llm.runtime.session

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.