Welcome to TensorRT-LLM’s documentation!
Contents:
- TensorRT-LLM Architecture
- C++ GPT Runtime
- The Batch Manager in TensorRT-LLM
- Inference Request
- Multi-head, Multi-query and Group-query Attention
- Numerical Precision
- Build from Source
- Performance of TensorRT-LLM
- How to debug
- How to add a new model
- Graph Rewriting Module
- Memory Usage of TensorRT-LLM
- New Workflow
- Run gpt-2b + LoRA using GptManager / cpp runtime
- Best Practices for Tuning the Performance of TensorRT-LLM
- Performance Analysis of TensorRT-LLM