tensorrt_llm

Getting Started

  • Overview
  • Quick Start Guide
  • Key Features
  • Release Notes

Installation

  • Installing on Linux
  • Building from Source Code on Linux
  • Installing on Windows
  • Building from Source Code on Windows

LLM API Examples

  • LLM Examples Introduction
  • Common Customizations
  • Examples
    • LLM Generate
    • LLM Generate Async
    • LLM Generate Async Streaming
    • LLM Generate Distributed
    • LLM Quantization
    • LLM Auto Parallel

LLM API

  • API Reference

Model Definition API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

C++ API

  • Executor
  • Runtime

Command-Line Reference

  • trtllm-build

Architecture

  • TensorRT-LLM Architecture
  • Model Definition
  • Compilation
  • Runtime
  • Multi-GPU and Multi-Node Support
  • TensorRT-LLM Checkpoint
  • TensorRT-LLM Build Workflow
  • Adding a Model

Advanced

  • Multi-Head, Multi-Query, and Group-Query Attention
  • C++ GPT Runtime
  • Graph Rewriting Module
  • The Batch Manager in TensorRT-LLM
  • Inference Request
  • Responses
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Expert Parallelism in TensorRT-LLM

Performance

  • Overview
  • Best Practices for Tuning the Performance of TensorRT-LLM
  • Performance Analysis

Reference

  • Troubleshooting
  • Support Matrix
  • Numerical Precision
  • Memory Usage of TensorRT-LLM

Blogs

  • H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
  • Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
  • Speed up inference with SOTA quantization techniques in TRT-LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
tensorrt_llm
  • Examples
  • View page source

Examples

Scripts

  • LLM Generate
  • LLM Generate Async
  • LLM Generate Async Streaming
  • LLM Generate Distributed
  • LLM Quantization
  • LLM Auto Parallel
Previous Next

Copyright © 2024 NVIDIA Corporation

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact