tensorrt_llm

Contents:

  • TensorRT-LLM Architecture
  • C++ GPT Runtime
  • The Batch Manager in TensorRT-LLM
  • Inference Request
  • Multi-head, Multi-query and Group-query Attention
  • Numerical Precision
  • Build from Source
  • Performance of TensorRT-LLM
  • How to debug
  • How to add a new model
  • Graph Rewriting Module
  • Memory Usage of TensorRT-LLM
  • New Workflow
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Best Practices for Tuning the Performance of TensorRT-LLM
  • Performance Analysis of TensorRT-LLM

Python API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

C++ API

  • Runtime

Blogs

  • H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
  • Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
  • Speed up inference with SOTA quantization techniques in TRT-LLM
tensorrt_llm
  • Welcome to TensorRT-LLM’s documentation!
  • View page source

Welcome to TensorRT-LLM’s documentation!

Contents:

  • TensorRT-LLM Architecture
  • C++ GPT Runtime
  • The Batch Manager in TensorRT-LLM
  • Inference Request
  • Multi-head, Multi-query and Group-query Attention
  • Numerical Precision
  • Build from Source
  • Performance of TensorRT-LLM
  • How to debug
  • How to add a new model
  • Graph Rewriting Module
  • Memory Usage of TensorRT-LLM
  • New Workflow
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Best Practices for Tuning the Performance of TensorRT-LLM
  • Performance Analysis of TensorRT-LLM

Python API

  • tensorrt_llm.layers

  • tensorrt_llm.functional

  • tensorrt_llm.models

  • tensorrt_llm.plugin

  • tensorrt_llm.quantization

  • tensorrt_llm.runtime

C++ API

  • cpp/runtime

Indices and tables

  • Index

  • Module Index

  • Search Page

Blogs

Next

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.