mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

[TRTLLM-5930][doc] 1.0 Documentation. (#6696 )

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

2025-09-04 05:29:43 -04:00

4.1 KiB

Raw Blame History

(product-overview)=

Overview

About TensorRT-LLM

TensorRT-LLM is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.

Key Capabilities

🔥 Architected on Pytorch

TensorRT-LLM provides a high-level Python LLM API that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo and the Triton Inference Server.

TensorRT-LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using native PyTorch code, making it easy to adapt the system to specific needs.

⚡ State-of-the-Art Performance

TensorRT-LLM delivers breakthrough performance on the latest NVIDIA GPUs:

DeepSeek R1: World-record inference performance on Blackwell GPUs
Llama 4 Maverick: Breaks the 1,000 TPS/User Barrier on B200 GPUs

🎯 Comprehensive Model Support

TensorRT-LLM supports the latest and most popular LLM architectures:

Language Models: GPT-OSS, Deepseek-R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4...
Multi-modal Models: LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision...

TensorRT LLM strives to support the most popular models on Day 0.

🚀 Advanced Optimization & Production Features

In-Flight Batching & Paged Attention: {ref}inflight-batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
Multi-GPU Multi-Node Inference: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
Advanced Quantization:
- FP4 Quantization: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
- FP8 Quantization: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
Speculative Decoding: Multiple algorithms including EAGLE, MTP and NGram
KV Cache Management: Paged KV cache with intelligent block reuse and memory optimization
Chunked Prefill: Efficient handling of long sequences by splitting context into manageable chunks
LoRA Support: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
Checkpoint Loading: Flexible model loading from various formats (HuggingFace, NeMo, custom)
Guided Decoding: Advanced sampling with stop words, bad words, and custom constraints
Disaggregated Serving (Beta): Separate context and generation phases across different GPUs for optimal resource utilization

🔧 Latest GPU Architecture Support

TensorRT-LLM supports the full spectrum of NVIDIA GPU architectures:

NVIDIA Blackwell: B200, GB200, RTX Pro 6000 SE with FP4 optimization
NVIDIA Hopper: H100, H200,GH200 with FP8 acceleration
NVIDIA Ada Lovelace: L40/L40S, RTX 40 series with FP8 acceleration
NVIDIA Ampere: A100, RTX 30 series for production workloads

What Can You Do With TensorRT-LLM?

Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT-LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI.To get started, refer to the {ref}quick-start-guide.

4.1 KiB Raw Blame History