From 65b793c77ea48de447bc99657a0cdb514b69fde8 Mon Sep 17 00:00:00 2001 From: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com> Date: Mon, 3 Nov 2025 18:06:04 +0800 Subject: [PATCH] [None][doc] Add the missing content for model support section and fix valid links for long_sequence.md (#8869) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> --- docs/source/features/long-sequence.md | 4 ++-- docs/source/overview.md | 5 +++++ 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/source/features/long-sequence.md b/docs/source/features/long-sequence.md index 37d8b8478a..61ea031541 100644 --- a/docs/source/features/long-sequence.md +++ b/docs/source/features/long-sequence.md @@ -26,7 +26,7 @@ Note that if chunked context is enabled, please set the `max_num_tokens` to be a
- feat_long_seq_chunked_attention + feat_long_seq_chunked_attention

Figure 1. Illustration of chunked attention

@@ -43,7 +43,7 @@ Note that chunked attention can only be applied to context requests.
- feat_long_seq_sliding_win_attn + feat_long_seq_sliding_win_attn

Figure 2. Illustration of sliding window attention

diff --git a/docs/source/overview.md b/docs/source/overview.md index dc77d1242d..fe44002b16 100644 --- a/docs/source/overview.md +++ b/docs/source/overview.md @@ -25,6 +25,11 @@ TensorRT LLM delivers breakthrough performance on the latest NVIDIA GPUs: TensorRT LLM supports the latest and most popular LLM architectures: +- **Language Models**: GPT-OSS, Deepseek-R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4... +- **Multi-modal Models**: LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision... + +TensorRT LLM strives to support the most popular models on **Day 0**. + ### FP4 Support [NVIDIA B200 GPUs](https://www.nvidia.com/en-us/data-center/dgx-b200/) , when used with TensorRT LLM, enable seamless loading of model weights in the new [FP4 format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/#what_is_nvfp4), allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference.