Contents:

TensorRT-LLM Architecture
C++ GPT Runtime
The Batch Manager in TensorRT-LLM
Multi-head, Multi-query and Group-query Attention
Numerical Precision
Performance of TensorRT-LLM
Build From Sources
How to debug
How to add a new model
Graph Rewriting Module

Python API

Layers
Functionals
Models
Plugin
Quantization
Runtime

C++ API

Runtime

tensorrt_llm

Overview: module code

All modules for which code is available

tensorrt_llm.functional
tensorrt_llm.layers.activation
tensorrt_llm.layers.attention
tensorrt_llm.layers.cast
tensorrt_llm.layers.conv
tensorrt_llm.layers.embedding
tensorrt_llm.layers.linear
tensorrt_llm.layers.mlp
tensorrt_llm.layers.normalization
tensorrt_llm.layers.pooling
tensorrt_llm.models.baichuan.model
tensorrt_llm.models.bert.model
tensorrt_llm.models.bloom.model
tensorrt_llm.models.chatglm2_6b.model
tensorrt_llm.models.chatglm6b.model
tensorrt_llm.models.enc_dec.model
tensorrt_llm.models.falcon.model
tensorrt_llm.models.gpt.model
tensorrt_llm.models.gptj.model
tensorrt_llm.models.gptneox.model
tensorrt_llm.models.llama.model
tensorrt_llm.models.opt.model
tensorrt_llm.models.quantized.quant
tensorrt_llm.quantization.mode
tensorrt_llm.runtime.generation
tensorrt_llm.runtime.kv_cache_manager
tensorrt_llm.runtime.session

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.