tensorrt_llm

Getting Started

  • Overview
  • Quick Start Guide
  • Key Features
  • Release Notes

Installation

  • Installing on Linux
  • Building from Source Code on Linux
  • Installing on Windows
  • Building from Source Code on Windows
  • Installing on Grace Hopper

LLM API

  • API Introduction
  • API Reference

LLM API Examples

  • LLM Examples Introduction
  • Common Customizations
  • Examples
    • Generate text with guided decoding
    • Generate text
    • Generate Text Asynchronously
    • Generate Text in Streaming
    • Generate text
    • Distributed LLM Generation
    • Control generated text using logits post processor
    • Generate Text Using Lookahead Decoding
    • Generate Text Using Medusa Decoding
    • Generate text with multiple LoRA adapters
    • Generation with Quantization
    • Automatic Parallelism with LLM

Model Definition API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

C++ API

  • Executor
  • Runtime

Command-Line Reference

  • trtllm-build
  • trtllm-serve

Architecture

  • TensorRT-LLM Architecture
  • Model Definition
  • Compilation
  • Runtime
  • Multi-GPU and Multi-Node Support
  • TensorRT-LLM Checkpoint
  • TensorRT-LLM Build Workflow
  • Adding a Model

Advanced

  • Multi-Head, Multi-Query, and Group-Query Attention
  • C++ GPT Runtime
  • Executor API
  • Graph Rewriting Module
  • Inference Request
  • Responses
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Expert Parallelism in TensorRT-LLM
  • KV cache reuse
  • Speculative Sampling

Performance

  • Overview
  • Benchmarking
  • Best Practices
  • Performance Analysis

Reference

  • Troubleshooting
  • Support Matrix
  • Numerical Precision
  • Memory Usage of TensorRT-LLM

Blogs

  • H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
  • Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
  • Speed up inference with SOTA quantization techniques in TRT-LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
tensorrt_llm
  • Examples
  • View page source

Examples

Scripts

  • Generate text with guided decoding
  • Generate text
  • Generate Text Asynchronously
  • Generate Text in Streaming
  • Generate text
  • Distributed LLM Generation
  • Control generated text using logits post processor
  • Generate Text Using Lookahead Decoding
  • Generate Text Using Medusa Decoding
  • Generate text with multiple LoRA adapters
  • Generation with Quantization
  • Automatic Parallelism with LLM
Previous Next

Copyright © 2024 NVIDIA Corporation

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact