[TRTLLM-10021][docs] Skip Softmax Attention blog and docs. (#10592)

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
This commit is contained in:
Bo Li 2026-02-06 12:11:21 +08:00 committed by GitHub
parent 2e6d9350fa
commit 639051e98b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 429 additions and 58 deletions

View File

@ -23,6 +23,12 @@ This branch is a prototype and not stable for production use. PRs are not accept
## Tech Blogs
* [02/06] Accelerating Long-Context Inference with Skip Softmax Attention
✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.html)
* [01/09] Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs
✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs)
* [10/13] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.html)

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 168 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 82 KiB

View File

@ -0,0 +1,276 @@
# Accelerating Long-Context Inference with Skip Softmax Attention
As context lengths grow from thousands to hundreds of thousands of tokens, attention computation becomes a major bottleneck in long-context LLM inference. TensorRT-LLM provides a [sparse attention framework](../../features/sparse-attention.md#framework-level-sparse-attention) that supports techniques like KV cache compression and sparse pattern prediction, featured in [RocketKV](https://arxiv.org/pdf/2502.14051) and [DSA](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf). However, these require framework-level support—additional token selection steps, auxiliary data structures, and kernel modifications are required, compared to the vanilla attention architecture. This complexity introduces **runtime overhead** that can offset performance gains, particularly when context lengths are not long enough to amortize the extra work.
In this blog, we introduce **Skip Softmax Attention**, a drop-in sparse attention technique that is designed to accelerate the existing pretrained models that use standard attention mechanisms like MHA, GQA, or MLA. Skip Softmax Attention based on top of the Flash Attention algorithm and only requires modifying the existing **attention kernels**. Due to this simplicity, the end-to-end performance gain is more predictable. In addition, it is only an approximation method of the attention kernel computation, making it compatible with nearly all the other features, such as FP8 attention, KV cache reuse, chunked prefill etc.
## Table of Contents
- [Accelerating Long-Context Inference with Skip Softmax Attention](#accelerating-long-context-inference-with-skip-softmax-attention)
- [Table of Contents](#table-of-contents)
- [Method Overview](#method-overview)
- [Example Usage](#example-usage)
- [Accuracy Evaluation](#accuracy-evaluation)
- [Performance Benchmark](#performance-benchmark)
- [Kernel Performance](#kernel-performance)
- [End-to-end Performance](#end-to-end-performance)
- [Reproduction](#reproduction)
- [Accuracy evaluation (LongBench V1/V2)](#accuracy-evaluation-longbench-v1v2)
- [End-to-end performance (TTFT/TPOT)](#end-to-end-performance-ttfttpot)
- [Conclusion](#conclusion)
## Method Overview
The idea of Skip Softmax Attention is to compare the local maximum $\tilde{m}_i^{(j)}$ of $Q \cdot K^T$ with the running global maximum $m_i^{(j)}$, and skip the softmax (exp) and BMM2 calculation for blocks that are below a certain threshold $\lambda$:
$$\tilde{m}_i^{(j)} - m_i^{(j)} < \lambda$$
In this way, we can indirectly control the sparsity via the threshold. The threshold is set to be inversely proportional to the context length, i.e., the longer the context, the smaller the threshold is needed to achieve the same sparsity.
The method is fully dynamic, and can be applied to both the prefilling and decoding. The algorithm of Skip Softmax Attention is described in the paper [BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding](https://arxiv.org/pdf/2512.12087). We have also published a [Developer Blog](https://developer.nvidia.com/blog/accelerating-long-context-inference-with-skip-softmax-in-nvidia-tensorrt-llm/) for explanation. Please refer to these resources for in-depth dive into the algorithm details. We will focus on the application of Skip Softmax Attention in TensorRT-LLM to accelerate long-context inference.
<p align="center">
<img src="../media/tech_blog16_blasst.jpg" alt="BLASST Illustration" style="width: 50%; min-width: 300px; display: block; margin: auto;" />
</p>
## Example Usage
Enabling Skip Softmax Attention is pretty simple: we only need to configure the `SkipSoftmaxAttentionConfig` and pass it to the `LLM` API:
```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import SkipSoftmaxAttentionConfig
sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor=1000.0)
# Additionally, the threshold_scale_factor for prefill and decode could be separately configured.
sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor={"prefill": 1000.0, "decode": 500.0})
llm = LLM(
model="Qwen/Qwen3-30B-A3B-Instruct-2507",
sparse_attention_config=sparse_attention_config,
# Other LLM arguments...
)
```
The configuration could also be specified through the extra LLM API options YAML file. An example to launch an OpenAI-compatible endpoint is shown below:
```bash
cat >extra_llm_api_options.yaml <<EOF
sparse_attention_config:
algorithm: skip_softmax
threshold_scale_factor: 1000.0
EOF
# Additionally, the threshold_scale_factor for prefill and decode could be separately configured.
cat >extra_llm_api_options.yaml <<EOF
sparse_attention_config:
algorithm: skip_softmax
threshold_scale_factor:
prefill: 1000.0
decode: 500.0
EOF
trtllm-serve Qwen/Qwen3-30B-A3B-Instruct-2507 --extra_llm_api_options extra_llm_api_options.yaml
```
The actual threshold value equals the `threshold_scale_factor` divided by the context length. [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) will support the calibration to automatically determine the required value given the target sparsity. We will use `Qwen3-30B-A3B-Instruct-2507` as the example model for testing, and the calibrated threshold scale factors are listed below:
| Target Sparsity | Threshold Scale Factor (Prefill) | Threshold Scale Factor (Decode) |
|:---------------:|:----------------------------:|:----------------------------:|
| 0.0 | 0.0 | 0.0 |
| 0.1 | 18.76 | 0.32 |
| 0.2 | 44.37 | 0.86 |
| 0.3 | 104.97 | 2.30 |
| 0.4 | 248.40 | 6.17 |
| 0.5 | 587.18 | 16.52 |
| 0.6 | 1390.63 | 44.26 |
| 0.7 | 3293.04 | 118.62 |
| 0.8 | 7799.91 | 317.99 |
| 0.9 | 18471.56 | 852.20 |
## Accuracy Evaluation
We evaluate the accuracy of Skip Softmax Attention using LongBench V1 and V2. LongBench V1 is a comprehensive benchmark for medium-to-long context understanding, with average sequence length of 10k tokens. LongBench V2 is a harder benchmark that contains longer sequences, and we pick its `medium` subset and truncate the prompt length to 256k due to the limit of the native context window of the model. The average sequence length of LongBench V2 is 130k tokens.
The evaluation results (on H200) are summarized in the table below:
| Target Sparsity | LongBench V1 Overall Accuracy | LongBench V2 Overall Accuracy |
|:---------------:|:----------------------------:|:----------------------------:|
| 0.0 | 47.79 | 35.81 |
| 0.1 | 47.92 | 40.00 |
| 0.2 | 47.80 | 38.14 |
| 0.3 | 47.84 | 38.60 |
| 0.4 | 47.84 | 39.53 |
| 0.5 | 46.87 | 37.67 |
| 0.6 | 46.45 | 36.28 |
| 0.7 | 45.56 | 34.88 |
| 0.8 | 43.38 | 36.74 |
| 0.9 | 39.84 | 39.07 |
Note that the number of samples in LongBench V2 is very small (~200), so the result is subject to large variance. You will see non-monotonic relationship between sparsity and accuracy. We recommend to look at LongBench V1 (~5000 samples) for inspecting the accuracy loss trend.
## Performance Benchmark
Skip Softmax Attention is supported on both Hopper and Blackwell GPUs and builds on top of the SoTA performance of the TensorRT-LLM's attention kernels. Hopper prefilling is implemented in [fmha_v2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/kernels/fmha_v2), Hopper decoding is implemented in [XQA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/kernels/xqa), and Blackwell is implemented in [trtllm-gen](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/kernels/trtllmGenKernels).
### Kernel Performance
We provide the performance data of the attention kernels under different achieved sparsity by specifying the threshold. The micro-benchmarking is performed under these configs: q_heads=64, kv_heads=4, head_dim=128, seqlen=16k/64k. Both BF16 and FP8 attention are supported. For prefilling, batch size is set to 1; for decoding, batch size is 64.
As a reference, the baseline performance data **without** Skip Softmax Attention are listed below (you can fill in the numbers).
**Prefill Baseline (we report compute performance in TFLOP/s):**
| GPU | Seqlen | Precision | TFLOP/s | Duration µs |
|:---:|:-----:|:---------:|--------:|--------------:|
| H200 | 16k | BF16 | 594.05 | 7403 |
| H200 | 16k | FP8 | 852.81 | 5157 |
| H200 | 64k | BF16 | 610.30 | 115301 |
| H200 | 64k | FP8 | 873.60 | 80550 |
| B200 | 16k | BF16 | 1029.13 | 4273 |
| B200 | 16k | FP8 | 1523.57 | 2886 |
| B200 | 64k | BF16 | 1038.26 | 67775 |
| B200 | 64k | FP8 | 1621.41 | 43399 |
**Decode Baseline (we report memory bandwidth in TB/s):**
| GPU | Seqlen | Precision | TB/s | Duration µs |
|:---:|:-----:|:---------:|-----------------:|--------------:|
| H200 | 16k | BF16 | 4.31 | 498 |
| H200 | 16k | FP8 | 4.03 | 266 |
| H200 | 64k | BF16 | 4.37 | 1962 |
| H200 | 64k | FP8 | 4.10 | 1045 |
| B200 | 16k | BF16 | 7.08 | 303 |
| B200 | 16k | FP8 | 5.46 | 196 |
| B200 | 64k | BF16 | 7.10 | 1209 |
| B200 | 64k | FP8 | 5.68 | 755 |
The following figures plot **speedup vs. achieved sparsity**, on top of the baseline performance:
<table style="width: 100%; border: 0;">
<tr>
<td style="width: 50%; padding: 0 8px; vertical-align: top;">
<p align="center"><b>Hopper (H200)</b></p>
<p align="center"><b>Prefill</b></p>
<img src="../media/tech_blog16_hopper_prefill.png" alt="Hopper prefill kernel" style="width: 100%; min-width: 280px; display: block; margin: auto;" />
<p align="center"><b>Decode</b></p>
<img src="../media/tech_blog16_hopper_decode.png" alt="Hopper decode kernel" style="width: 100%; min-width: 280px; display: block; margin: auto;" />
</td>
<td style="width: 50%; padding: 0 8px; vertical-align: top;">
<p align="center"><b>Blackwell (B200)</b></p>
<p align="center"><b>Prefill</b></p>
<img src="../media/tech_blog16_blackwell_prefill.png" alt="Blackwell prefill kernel" style="width: 100%; min-width: 280px; display: block; margin: auto;" />
<p align="center"><b>Decode</b></p>
<img src="../media/tech_blog16_blackwell_decode.png" alt="Blackwell decode kernel" style="width: 100%; min-width: 280px; display: block; margin: auto;" />
</td>
</tr>
</table>
Skip Softmax Attention could further boost the performance of FP8 attention, though the gain is less significant compared to BF16.
### End-to-end Performance
We benchmark the end-to-end performance to demonstrate the benefit of Skip Softmax Attention. Due to the quadratic complexity of the attention, the TTFT in long-context scenarios is often a severe blocker for real-world usage. Skip Softmax Attention can significantly reduce the TTFT by accelerating the prefilling kernel, and the TPOT can also be reduced if the context length is long enough. The experiemnt is conducted on a single H200 or B200 GPU, using the exact same dataset as the accuracy evaluation.
**LongBench V1, avg ISL=10k, OSL=6:**
| Target Sparsity | TTFT/ms (H200) | TPOT/ms (H200) | TTFT/ms (B200) | TPOT/ms (B200) |
|:--------------:|------------------:|-----------------:|--------------------:|--------------------:|
| 0.0 | 9419.61 | 1731.80 | 4854.55 | 928.45 |
| 0.1 | 9519.40 | 1746.73 | 4758.06 | 909.08 |
| 0.2 | 9417.36 | 1729.74 | 4794.23 | 916.64 |
| 0.3 | 9304.48 | 1711.27 | 4770.26 | 913.51 |
| 0.4 | 9139.85 | 1684.78 | 4672.09 | 896.25 |
| 0.5 | 8847.43 | 1633.08 | 4548.07 | 873.80 |
| 0.6 | 8437.45 | 1560.64 | 4459.08 | 858.60 |
| 0.7 | 8134.72 | 1508.60 | 4385.12 | 846.64 |
| 0.8 | 8107.73 | 1507.82 | 4348.80 | 831.88 |
| 0.9 | 8130.39 | 1516.16 | 4150.44 | 798.93 |
LongBench V1 results are reported with concurrency 64. Due to the nature of in-flight batching, the decoding requests might be piggybacked with the prefilling requests, so the TPOT is relatively high.
**LongBench V2, avg ISL=130k, OSL=200:**
| Target Sparsity | TTFT/ms (H200) | TPOT/ms (H200) | TTFT/ms (B200) | TPOT/ms (B200) |
|:--------------:|------------------:|-----------------:|--------------------:|--------------------:|
| 0.0 | 16486.70 | 9.34 | 6990.59 | 6.30 |
| 0.1 | 16487.54 | 8.61 | 7024.50 | 6.30 |
| 0.2 | 16169.69 | 8.61 | 6687.21 | 6.34 |
| 0.3 | 15750.17 | 8.46 | 6616.12 | 6.33 |
| 0.4 | 15288.68 | 8.61 | 6432.32 | 6.27 |
| 0.5 | 14554.04 | 8.45 | 6193.92 | 6.29 |
| 0.6 | 14323.08 | 8.44 | 5966.53 | 6.32 |
| 0.7 | 13871.32 | 8.42 | 5769.19 | 6.31 |
| 0.8 | 12922.99 | 8.58 | 5605.66 | 6.23 |
| 0.9 | 12507.95 | 8.58 | 5276.67 | 6.29 |
Due to the extremely long context length, we only run LongBench V2 with concurrency 1. In this scenario, the prefilling/decoding is better separated and we can observe how is TTFT/TPOT affected by the sparsity. Note that the speedup for decoding is less pronounced under small batch size. Small batch size and small number of heads (with TP) are more close to real-world usage for long-context serving due to the limit of SLO, and we are actively optimizing the decoding performance under such scenarios.
## Reproduction
We provide the commands to reproduce the results in the previous context, as a showcase of how to evaluate the accuracy and benchmark the performance for Skip Softmax Attention.
### Accuracy evaluation (LongBench V1/V2)
Both LongBench V1 and V2 are integrated into the TensorRT-LLM accuracy test suite, `trtllm-eval`. Here are the example scripts to run the accuracy evaluation:
```bash
# Dump the extra LLM API options YAML file.
cat >extra_llm_api_options.yaml <<EOF
kv_cache_config:
free_gpu_memory_fraction: 0.8
sparse_attention_config:
algorithm: skip_softmax
threshold_scale_factor:
prefill: ${thr_prefill}
decode: ${thr_decode}
EOF
```
```bash
# Evaluate LongBench V1.
trtllm-eval --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--max_batch_size 64 --max_num_tokens 100000 \
--extra_llm_api_options extra_llm_api_options.yaml \
longbench_v1 \
--output_dir ${OUTPUT_DIR} # Dump dataset for perf benching
```
```bash
# Evaluate LongBench V2.
trtllm-eval --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
--max_batch_size 1 --max_num_tokens 262144 \
--extra_llm_api_options extra_llm_api_options.yaml \
longbench_v2 \
--length medium \ # Medium subset of LongBench V2
--max_input_length 256000 \ # Truncate the prompt length to 256k
--output_dir ${OUTPUT_DIR} # Dump dataset for perf benching
```
### End-to-end performance (TTFT/TPOT)
The option `--output_dir` in `trtllm-eval` will dump the dataset in the format required by `trtllm-bench` as `dumped_ids.json`. After getting the data, we can perform end-to-end benchmarking:
```bash
# Benchmark on LongBench V1.
trtllm-bench --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
throughput --dataset ${OUTPUT_DIR}/dumped_ids.json \
--concurrency 64 --max_batch_size 64 --max_num_tokens 100000 \
--extra_llm_api_options extra_llm_api_options.yaml \
--warmup 0 --streaming \
--report_json ${OUTPUT_DIR}/report.json
```
```bash
# Benchmark on LongBench V2.
trtllm-bench --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
throughput --dataset ${OUTPUT_DIR}/dumped_ids.json \
--concurrency 1 --max_batch_size 1 --max_num_tokens 262144 \
--extra_llm_api_options extra_llm_api_options.yaml \
--warmup 0 --streaming \
--report_json ${OUTPUT_DIR}/report.json
```
## Conclusion
Skip Softmax Attention is a kernel-based solution for accelerating the attention. Due to the design that BMM1 ($Q \cdot K^T$) in the attention kernel is not skipped, the performance gain is capped to 1.8x at kernel level. Nevertheless, it excels at achieving high sparsity with minimal accuracy degradation, and is especially effective in the medium-to-long context scenarios where previous methods like MInference cannot well handle, because the introduced runtime overhead may not pay off the speedup of the attention kernel. The drop-in nature of Skip Softmax Attention makes it a flexible, easy-to-use method for accelerating long-context inference. The Skip Softmax Attention kernels will also be available in FlashInfer for adoptions by the open-source community.

View File

@ -1,20 +1,23 @@
# Sparse Attention
- [Background and Motivation](#background-and-motivation)
- [Algorithm Overview](#algorithm-overview)
- [Quick Start](#quick-start)
- [Python API](#python-api)
- [Usage with trtllm-bench or trtllm-serve](#usage-with-trtllm-bench-or-trtllm-serve)
- [Developer Guide](#developer-guide)
- [Architecture Overview](#architecture-overview)
- [Framework Implementation](#framework-implementation)
- [Implementing a New Algorithm](#implementing-a-new-algorithm)
- [1. Configuration Class](#1-configuration-class)
- [2. Implement the prediction module in Attention Backend](#2-implement-the-prediction-module-in-attention-backend)
- [3. Manage Auxiliary Memory Pool](#3-manage-auxiliary-memory-pool)
- [4. Registration and Dispatch](#4-registration-and-dispatch)
- [Summary and Future Work](#summary-and-future-work)
- [Current Status](#current-status)
- [Future Work](#future-work)
- [Configure via YAML file](#configure-via-yaml-file)
- [Sparse Attention Implementation](#sparse-attention-implementation)
- [Framework-Level Sparse Attention](#framework-level-sparse-attention)
- [Overview](#overview)
- [Architecture](#architecture)
- [Framework Implementation](#framework-implementation)
- [Implementing a New Algorithm](#implementing-a-new-algorithm)
- [1. Configuration Class](#1-configuration-class)
- [2. Implement the prediction module in Attention Backend](#2-implement-the-prediction-module-in-attention-backend)
- [3. Manage Auxiliary Memory Pool](#3-manage-auxiliary-memory-pool)
- [4. Registration and Dispatch](#4-registration-and-dispatch)
- [Future Work](#future-work)
- [Kernel-Level Sparse Attention](#kernel-level-sparse-attention)
- [Summary](#summary)
## Background and Motivation
@ -23,44 +26,45 @@ As Large Language Models (LLMs) are applied to increasingly complex tasks such a
* **Context Phase**: Processing long prompts requires substantial memory bandwidth and computation, affecting time-to-first-token (TTFT). Since the context phase is typically compute-bound, reducing the computational load here is critical.
* **Generation Phase**: The Key-Value (KV) cache grows with every generated token, consuming vast amounts of GPU memory and bandwidth. Since the generation phase is usually memory-bound, reducing the memory footprint directly alleviates memory pressure, improves token-to-token latency (TPOT), and allows for larger batch sizes.
Fortunately, key observations indicate that attention scores naturally exhibit sparsity, meaning not all K/V tokens are necessary for attention computation. To enhance the efficiency of long-sequence LLMs, numerous methods have been proposed to optimize performance by leveraging approximate sparse attention. Among those methods, sparsity can be applied to different dimensions of the attention: head dimension, hidden dimension, and sequence dimension. When applying sparsity to the sequence dimension, those methods selectively compute only the most important query-key pairs. This approach can be referred to as token sparsity. Token sparsity has been widely explored in lots of recent academic works, and it is also a kind of structured sparse method that is friendly for GPU. Currently, TensorRT LLM focuses on the sparse attention methods that leverages token sparsity.
Sparse attention methods aim to exploit structured sparsity in attention. Especially, exploiting the token sparsity in the sequence dimension to concentrate on the most important query-key pairs is very common. The goal of sparse attention is accelerating long-context inference, while balancing performance gains with acceptable approximation error and system complexity.
Token sparsity can be applied to two distinct aspects of LLM inference:
* **Sparse Computation**: If a query token does not require the entire history, just skip the computation for irrelevant tokens, thereby reducing attention computational costs.
* **Sparse KV cache**: Evicts KV tokens from the cache that are not required for future generation steps. This reduces GPU memory usage and lowers computation overhead for subsequent steps.
## Algorithm Overview
The design space of sparse attention is quite large, so we cannot assume there is a single implementation strategy that covers all variants. TensorRT LLM uses `sparse_attention_config` in the `LLM` API as a unified interface for **describing and enabling** different sparse attention algorithms, while allowing each technique to choose the most suitable implementation path. Each *algorithm* has its own configuration class inheriting from `BaseSparseAttentionConfig`.
Both methods can be enabled simultaneously to achieve better performance.
TensorRT LLM currently exposes the following algorithms differentiated by `sparse_attention_config.algorithm`:
To support these emerging techniques, TensorRT LLM has designed a general, extensible and flexible **sparse attention framework** (which is continuously being optimized) to compatibly integrate advanced sparse algorithms. Currently we can support [RocketKV](https://arxiv.org/pdf/2502.14051) and [DSA](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf).
- **RocketKV** (`algorithm: rocket`, `RocketSparseAttentionConfig`, [ref](https://arxiv.org/pdf/2502.14051)): A two-stage algorithm, where the first stage performs permanent KV cache eviction and the second stage performs dynamic token selection.
- **DeepSeek Sparse Attention (DSA)** (`algorithm: dsa`, `DeepSeekSparseAttentionConfig`, [ref](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf)): DeepSeek's model-native sparse attention solution, introduced in DeepSeek V3.2.
- **Skip Softmax Attention (BLASST)** (`algorithm: skip_softmax`, `SkipSoftmaxAttentionConfig`, [ref](https://arxiv.org/pdf/2512.12087)): A drop-in method that dynamically skips Softmax and BMM2 work for unimportant KV blocks, which could be fully implemented inside the attention kernels.
## Quick Start
This section provides a brief guide on enabling sparse attention in TensorRT LLM, using RocketKV as an example. For more details, please refer to [RocketKV sparse attention](../../examples/sparse_attention/RocketKV.md).
This section shows how to enable sparse attention through the `LLM` API or YAML config.
### Python API
To use sparse attention, you need to configure a specific `SparseAttentionConfig` (for example, `RocketSparseAttentionConfig`) and pass it to the `LLM` constructor.
To use sparse attention, configure a `BaseSparseAttentionConfig` subclass and pass it to the `LLM` constructor. Each algorithm has its own configuration class inheriting from `BaseSparseAttentionConfig`. To learn about the meaning of specific parameters, please refer to the docstring of the corresponding configuration class.
#### RocketKV
```python
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import RocketSparseAttentionConfig, KvCacheConfig
# 1. Configure Sparse Attention
# Example: RocketKV configuration
rocket_config = RocketSparseAttentionConfig(
# 1. Configure sparse attention (RocketKV)
sparse_attention_config = RocketSparseAttentionConfig(
prompt_budget=2048,
kt_cache_dtype='float8_e5m2'
)
# 2. Configure KV Cache
# Note: Some sparse algorithms (like RocketKV) may require disabling block reuse
# 2. Configure KV cache
# Note: some framework-based algorithms may require disabling block reuse.
kv_config = KvCacheConfig(enable_block_reuse=False)
# 3. Initialize LLM
llm = LLM(
model="<path_to_model>",
backend='pytorch', # Currently requires the PyTorch backend
sparse_attention_config=rocket_config,
model="<path_or_hf_id>",
sparse_attention_config=sparse_attention_config,
kv_cache_config=kv_config,
)
@ -69,15 +73,49 @@ prompts = ["To be or not to be..."]
outputs = llm.generate(prompts, SamplingParams(max_tokens=128))
```
### Usage with `trtllm-bench` or `trtllm-serve`
#### DSA
You can enable sparse attention in benchmarking and serving tools by providing a `sparse_attention_config` in an `extra_config.yaml` file.
```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import DeepSeekSparseAttentionConfig
**extra_config.yaml:**
# Example: DSA configuration (exact values depend on model + use case)
sparse_attention_config = DeepSeekSparseAttentionConfig(
index_topk=64,
)
llm = LLM(
model="<path_or_hf_id>",
sparse_attention_config=sparse_attention_config,
)
```
#### Skip Softmax Attention
```python
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import SkipSoftmaxAttentionConfig
# One value for both phases:
sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor=1000.0)
# Or configure prefill/decode separately:
sparse_attention_config = SkipSoftmaxAttentionConfig(
threshold_scale_factor={"prefill": 1000.0, "decode": 500.0}
)
llm = LLM(
model="<path_or_hf_id>",
sparse_attention_config=sparse_attention_config,
)
```
### Configure via YAML file
Besides Python API, you can also configure sparse attention via YAML file. This is typically more convenient in bash commands, such as `trtllm-serve` and `trtllm-eval`.
**Rocket KV**
```yaml
backend: pytorch
attn_backend: TRTLLM
sparse_attention_config: # RocketKV as an example
sparse_attention_config:
algorithm: rocket
kt_cache_dtype: float8_e5m2
prompt_budget: 2048
@ -86,6 +124,23 @@ kv_cache_config:
enable_chunked_prefill: false
```
**DSA**
```yaml
sparse_attention_config:
algorithm: dsa
index_topk: 64
```
**Skip Softmax Attention**
```yaml
attn_backend: TRTLLM
sparse_attention_config:
algorithm: skip_softmax
threshold_scale_factor:
prefill: 1000.0
decode: 500.0
```
Run the command with the config file:
```bash
trtllm-bench/trtllm-serve --model <model_path> --config extra_config.yaml ...
@ -97,26 +152,41 @@ For example, users can evaluate a model with trtllm-eval on LongBenchV2 task lik
trtllm-eval --model <path_to_model> --config extra_config.yaml longbench_v2 --max_output_length 1024 ...
```
## Developer Guide
## Sparse Attention Implementation
This section describes the sparse attention framework architecture and guides developers on how to implement new sparse attention algorithms in TensorRT LLM. Unless otherwise specified, this framework primarily targets **MQA/GQA/MLA-based** attention mechanisms.
This section provides deeper technical details on how each algorithm of sparse attention is implemented in TensorRT LLM. If you just want to enable sparse attention, see [Quick Start](#quick-start) above.
### Architecture Overview
Ideologically, the current available sparse attention algorithms can be categorized into two types:
- **Framework-level sparse attention**: uses TensorRT LLM's sparse-attention framework (prediction hooks + metadata) to drive sparse computation and/or KV-cache behavior. Examples: **RocketKV**, **DSA**.
- **Kernel-level sparse attention**: implemented directly inside the attention kernels, with no extra modification on the runtime logic. Example: **Skip Softmax Attention**.
### Framework-Level Sparse Attention
Framework-level sparse attention refers to methods that use TensorRT LLM's extensible sparse-attention framework—a set of prediction hooks and metadata interfaces that drive sparse computation and/or KV-cache behavior. Currently, **RocketKV** and **DSA** are the supported framework-level sparse attention algorithms in TensorRT LLM.
#### Overview
Attention scores often exhibit strong structure and sparsity: for many queries, only a small fraction of the historical tokens meaningfully contribute to the output. To exploit this, a wide range of approximate sparse-attention methods have been proposed. These methods can introduce sparsity along different dimensions (e.g., sequence, head, hidden). TensorRT LLMs **framework-level** support for sparse attention primarily targets approaches that leverage **token/sequence sparsity** into a GPU-friendly, structured way.
#### Architecture
This section describes the framework architecture and guides developers on how to implement new framework-level sparse attention algorithms in TensorRT LLM.
<div align="center">
<figure>
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/media/sparse_attention_framework.png" width="800">
</figure>
</div>
<p align="center"><sub><em>Figure 1: The sparse attention framework in TensorRT LLM.</em></sub></p>
<p align="center"><sub><em>Figure 1: The framework support for sparse attention in TensorRT LLM.</em></sub></p>
Our goal is to design a general, extensible, and flexible sparse attention framework. In this framework, the attention operator provides the unified APIs to support both **sparse computation** and **sparse KV cache** that leverage token sparsity, while the users/developers can only focus on the algorithm of sparse attentions, i.e. how to accurately identify important query-key pairs.
For the generality, TensorRT LLM abstracts sparse attention into a prediction-based workflow: *a prediction module first identifies the sparse indices (tokens/blocks to keep or attend to), which are then used by the subsequent attention operator*. Currently, for standard attention (MQA/GQA), TensorRT LLM supports **sparse KV cache** in the context phase and **sparse computation** in the generation phase. Different KV heads are allowed to use different sparse indices, while Q heads that map to the same KV head share the same sparse pattern. It does **not** yet support sparse computation in the context phase or sparse KV cache in the generation phase.
For the scalability, figure 1 illustrates the overall design. The architecture is built by inheriting from the existing `AttentionBackend` to define algorithm-specific sparse attention backends. Within these backends, `prediction` methods are implemented to generate the corresponding sparse indices. These indices are then passed as arguments to the `AttentionOp` to perform the sparse attention computation. This approach balances system flexibility with extensibility, allowing new algorithms to be integrated by simply defining their prediction logic **without** modifying the core attention kernels.
For the scalability, Figure 1 illustrates the overall design. The architecture is built by inheriting from the existing `AttentionBackend` to define algorithm-specific sparse attention backends. Within these backends, `prediction` methods are implemented to generate the corresponding sparse indices. These indices are then passed as arguments to the `AttentionOp` to perform the sparse attention computation. This approach balances system flexibility with extensibility, allowing new algorithms to be integrated by simply defining their prediction logic **without** modifying the core attention kernels.
TensorRT LLM currently supports the following features:
TensorRT LLM currently supports the following features in the framework:
1. **Context Phase**:
* **sparse computation**: MLA
@ -126,7 +196,7 @@ TensorRT LLM currently supports the following features:
* **sparse computation**: MLA/MQA/GQA
* **sparse KV cache**: no support yet
### Framework Implementation
#### Framework Implementation
To hide the complexity of sparse algorithms, the main prediction logic is encapsulated within the `tensorrt_llm._torch.attention_backend` module.
@ -159,11 +229,11 @@ The key files located in `tensorrt_llm/_torch/attention_backend/sparse/` are:
In `AttentionOp`, currently, the MQA/GQA sparse attention only supports sparse computation at block granularity in the generation phase, where the block size equals to the page size of the KV cache. It means that we can skip the attention computation of those unimportant pages. In addition, we provide a sparse MLA kernel that supports token-level sparse computation in both the context and generation phases.
To support those features, as illustrated in figure 2, we have implemented two kernels for the MQA/GQA path, `updateSparseKvCacheAfterFmha` and `gatherKvPageOffsetsKernel`, applied in the context and generation phases respectively:
To support those features, as illustrated in Figure 2, we have implemented two kernels for the MQA/GQA path, `updateSparseKvCacheAfterFmha` and `gatherKvPageOffsetsKernel`, applied in the context and generation phases respectively:
* **`updateSparseKvCacheAfterFmha`**: Invoked in the post-processing stage after the context attention computation. It selects the important KV tokens and write those K/V vectors to the KV cache to reduce the KV cache size.
* **`gatherKvPageOffsetsKernel`**: Executed before the attention computation in the generation phase. It converts the input sparse indices (which can be of arbitrary granularity) into page-aligned indices. This means that if a single token is selected, the entire page it is included in the attention computation. After this conversion, we will get a new `kv_page_offsets` and also an updated `kv_len` that is the number of those selected KV tokens. Then these new metadata are fed into the subsequent attention kernel for computation.
* **`gatherKvPageOffsetsKernel`**: Executed before the attention computation in the generation phase. It converts the input sparse indices (which can be of arbitrary granularity) into page-aligned indices. This means that if a single token is selected, the entire page is included in the attention computation. After this conversion, we will get a new `kv_page_offsets` and also an updated `kv_len` that is the number of those selected KV tokens. Then these new metadata are fed into the subsequent attention kernel for computation.
For sparse MLA, the kernel supports token sparsity directly, eliminating the need for `gatherKvPageOffsetsKernel`. However, please note that sparse KV cache support is not yet available.
@ -175,7 +245,7 @@ Many sparse attention algorithms also require additional auxiliary memory. In th
Each option has its own advantages and disadvantages, please refer to the [Manage Auxiliary Memory Pool](#3-manage-auxiliary-memory-pool) for more details.
### Implementing a New Algorithm
#### Implementing a New Algorithm Inside the Sparse Attention Framework
#### 1. Configuration Class
@ -237,19 +307,38 @@ For tighter integration, you can manage the auxiliary memory within the C++ `KVC
* Register your config and backend in `tensorrt_llm/_torch/attention_backend/sparse/utils.py` and `tensorrt_llm/_torch/pyexecutor/_util.py` to ensure the system routes the request to your new backend when the config is present.
* Add initialization logic in `cpp/tensorrt_llm/thop/attentionOp.cpp` and `cpp/tensorrt_llm/kernels/sparseAttentionKernels.h` if new C++ level parameters are required.
## Summary and Future Work
#### Future Work
### Current Status
Currently, the status of the sparse attention framework is as follows:
1. **Supported Operations**: The `AttentionOp` currently supports **sparse KV cache** in the context phase and **sparse computation** in the generation phase. Other combinations (for example, sparse computation in the context phase) are not yet supported for MQA/GQA. For MLA, sparse computation is supported in both the context and generation phases.
2. **Algorithm Support**: RocketKV is supported in both the vanilla (PyTorch) backend and the TRTLLM backend, while DSA is supported in the TRTLLM backend. These implementations validate the generality and scalability of the framework.
### Future Work
* **Sparse Computation in Context Phase**: We plan to introduce sparse computation support for the context phase for MQA/GQA, allowing the TensorRT LLM sparse attention framework to cover more scenarios.
* **Dynamic Eviction in Generation Phase**: Dynamically evicting KV cache blocks during the generation phase poses significant challenges to KV cache flexibility. While difficult to implement in the current framework, block-level eviction appears to be a promising compromise and is under further exploration.
* **Unified Auxiliary Memory Management**: We are exploring a unified mechanism to manage auxiliary memory pools. This would allow users to define custom auxiliary spaces more flexibly while automatically inheriting advanced features from the KV cache, such as reuse and offloading.
* **Sparse Computation in Context Phase**: We plan to introduce sparse computation support for the context phase for MQA/GQA, allowing the framework to cover more scenarios.
* **Dynamic Eviction in Generation Phase**: Dynamically evicting KV cache blocks during the generation phase poses significant challenges to KV cache flexibility. Block-level eviction appears to be a promising compromise and is under exploration.
* **Unified Auxiliary Memory Management**: We are exploring a unified mechanism to manage auxiliary memory pools, allowing custom auxiliary spaces to automatically inherit advanced features from the KV cache (e.g., reuse, offloading).
* **Code Refactoring**: As more sparse attention algorithms are integrated, the framework will undergo refactoring to unify code and improve maintainability.
* **Optimizations**: We are discussing further optimizations, such as improving DSA performance.
### Kernel-Level Sparse Attention
Unlike framework-level methods, **kernel-level sparse attention** is implemented directly inside the attention kernels. There is no external prediction/gather workflow—the kernel itself decides what to skip based on runtime criteria.
**Skip Softmax Attention (BLASST)** is TensorRT LLM's kernel-level sparse attention method, supported on both **Hopper** and **Blackwell** GPUs for MHA/GQA/MLA, in both prefill and decode phases. It dynamically skips Softmax and BMM2 computation for KV blocks whose contribution falls below a threshold. Because the logic lives entirely inside the kernel, it requires no auxiliary data structures or framework hooks—just set `threshold_scale_factor` in the config. As a result, the runtime overhead is zero and the attention kernel performance improvement could be directly reflected in the end-to-end speedup.
For algorithm details and end-to-end results, please refer to the following resources:
- **Paper**: [BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding](https://arxiv.org/pdf/2512.12087)
- **NVIDIA developer blog**: [Accelerating Long-Context Inference with Skip Softmax Attention](https://developer.nvidia.com/blog/accelerating-long-context-inference-with-skip-softmax-in-nvidia-tensorrt-llm/)
- **Tech blog**: [Accelerating Long-Context Inference with Skip Softmax Attention](../blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md)
Skip Softmax Attention is supported only with the **trtllm** attention backend, implemented inside TensorRT-LLM's high-performance attention kernels:
- **Hopper prefill**: [fmha_v2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/kernels/fmha_v2)
- **Hopper decode**: [XQA](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/kernels/xqa)
- **Blackwell**: [trtllm-gen](https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/kernels/trtllmGenKernels)
### Summary
The following table compares the three sparse attention algorithms available in TensorRT LLM:
| Aspect | RocketKV | DSA | Skip Softmax |
|--------|----------|-----|--------------|
| Prefill Acceleration | No | Yes | Yes |
| Decode Acceleration | Yes | Yes | Yes |
| KV Cache Reduction | Yes | No | No |
| Framework-Level Support Required | Yes | Yes | No |
| Model Native | No | Yes | No |