Contents:

TensorRT-LLM Architecture
C++ GPT Runtime
The Batch Manager in TensorRT-LLM
Multi-head, Multi-query and Group-query Attention
Numerical Precision
Performance of TensorRT-LLM
Build From Sources
How to debug
How to add a new model
Graph Rewriting Module

Python API

Layers
Functionals
Models
Plugin
Qunatization
- QuantMode
Runtime

C++ API

Runtime

tensorrt_llm

Qunatization
View page source

Qunatization

class tensorrt_llm.quantization.QuantMode(value)[source]

Bases: IntFlag

An enumeration.

Previous Next

© Copyright 2023, NVidia.

Built with Sphinx using a theme provided by Read the Docs.