tensorrt_llm

Contents:

  • TensorRT-LLM Architecture
  • C++ GPT Runtime
  • The Batch Manager in TensorRT-LLM
  • Multi-head, Multi-query and Group-query Attention
  • Numerical Precision
  • Performance of TensorRT-LLM
  • Build From Sources
  • How to debug
  • How to add a new model
  • Graph Rewriting Module

Python API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

C++ API

  • Runtime
tensorrt_llm
  • Overview: module code

All modules for which code is available

  • tensorrt_llm.functional
  • tensorrt_llm.layers.activation
  • tensorrt_llm.layers.attention
  • tensorrt_llm.layers.cast
  • tensorrt_llm.layers.conv
  • tensorrt_llm.layers.embedding
  • tensorrt_llm.layers.linear
  • tensorrt_llm.layers.mlp
  • tensorrt_llm.layers.normalization
  • tensorrt_llm.layers.pooling
  • tensorrt_llm.models.baichuan.model
  • tensorrt_llm.models.bert.model
  • tensorrt_llm.models.bloom.model
  • tensorrt_llm.models.chatglm2_6b.model
  • tensorrt_llm.models.chatglm6b.model
  • tensorrt_llm.models.enc_dec.model
  • tensorrt_llm.models.falcon.model
  • tensorrt_llm.models.gpt.model
  • tensorrt_llm.models.gptj.model
  • tensorrt_llm.models.gptneox.model
  • tensorrt_llm.models.llama.model
  • tensorrt_llm.models.opt.model
  • tensorrt_llm.models.quantized.quant
  • tensorrt_llm.quantization.mode
  • tensorrt_llm.runtime.generation
  • tensorrt_llm.runtime.kv_cache_manager
  • tensorrt_llm.runtime.session

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.