tensorrt_llm
Contents:
TensorRT-LLM Architecture
C++ GPT Runtime
The Batch Manager in TensorRT-LLM
Multi-head, Multi-query and Group-query Attention
Numerical Precision
TensorRT-LLM Installation
Performance of TensorRT-LLM
How to debug
How to add a new model
Graph Rewriting Module
Memory Usage of TensorRT-LLM
New Workflow
Python API
Layers
Functionals
Models
Plugin
Quantization
Runtime
C++ API
Runtime
Blogs
H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
tensorrt_llm
Plugin
View page source
Plugin