diff --git a/docs/source/blogs/media/tech_blog16_blasst.jpg b/docs/source/blogs/media/tech_blog16_blasst.jpg new file mode 100644 index 0000000000..4b96efd03c Binary files /dev/null and b/docs/source/blogs/media/tech_blog16_blasst.jpg differ diff --git a/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax.md b/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax.md new file mode 100644 index 0000000000..8a7fd0efde --- /dev/null +++ b/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax.md @@ -0,0 +1,227 @@ +# Skip Softmax Attention: A Drop-in Sparse Attention Technique for Accelerating Long-Context Inference + +In the previous [tech blog](https://github.com/heyuhhh/TensorRT-LLM/blob/user/yuhangh/add_sprase_attention_tech_blog/docs/source/blogs/tech_blog/blog15_Sparse_Attention_in_TensorRT-LLM.md) (TODO: Update link), we introduced the framework to support sparse attention in TensorRT-LLM. The methods we covered, no matter KV cache compression after the context phase, or the sparse tokens prediction in the generation phase, all require some **runtime** modifications. Therefore, they are relatively complex to implement and apply. More importantly, the additional operations compared to the full attention brings computational overhead, which would be detrimental to the performance gain of the core attention computation. Whether those methods are beneficial depends on the specific scenarios, e.g., if the context length is not long enough, enabling those methods may result in negative performance impact. On the other hand, Skip Softmax is only an approximation method of the attention kernel computation, making it compatible with nearly all the other features, such as FP8 attention, KV cache reuse, chunked prefill etc. + +In this blog, we introduce **Skip Softmax Attention**, a drop-in sparse attention technique that is fully compatible with the Flash Attention algorithm and only requires modifying the existing **attention kernels**. Compared to full attention, the end-to-end performance gain is therefore more predictable. + +## Method Overview + +The idea of Skip Softmax Attention is to compare the local maximum ($\tilde{m}_i^{(j)}$) of $Q \cdot K^T$ with the running global maximum ($m_i^{(j)}$), and skip the softmax (exp) and BMM2 calculation for blocks that are below a certain threshold $\lambda$: +$$ + \tilde{m}_i^{(j)} - m_i^{(j)} < \lambda. +$$ +In this way, we can indirectly control the sparsity via the threshold. Note that the threshold is inversely proportional to the context length, i.e., the longer the context, the smaller the threshold is needed to achieve the same sparsity. + +The method is fully dynamic, and can be applied to both the prefilling and decoding. The algorithm of Skip Softmax Attention is described in the paper [BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding](https://arxiv.org/pdf/2512.12087). We have also published a [Developer Blog](https://developer.nvidia.com/blog/accelerating-long-context-inference-with-skip-softmax-in-nvidia-tensorrt-llm/) for explanation. Please refer to these resources for in-depth dive into the algorithm details. We will focus on the application of Skip Softmax Attention in TensorRT-LLM to accelerate long-context inference. +![tech_blog16_blasst.jpg](../media/tech_blog16_blasst.jpg) + +## Example Usage + +Enabling Skip Softmax Attention is pretty simple: we only need to configure the `SkipSoftmaxAttentionConfig` and pass it to the `LLM` API: + +```python +from tensorrt_llm import LLM +from tensorrt_llm.llmapi import SkipSoftmaxAttentionConfig + +sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor=1000.0) + +# Additionally, the threshold_scale_factor for prefill and decode could be separately configured. +sparse_attention_config = SkipSoftmaxAttentionConfig(threshold_scale_factor={"prefill": 1000.0, "decode": 500.0}) + +llm = LLM( + model="Qwen/Qwen3-30B-A3B-Instruct-2507", + sparse_attention_config=sparse_attention_config, + # Other LLM arguments... +) +``` + +The configuration could also be specified through the extra LLM API options YAML file. An example to launch an OpenAI-compatible endpoint is shown below: + +```bash +cat >extra_llm_api_options.yaml <extra_llm_api_options.yaml <extra_llm_api_options.yaml <extra_llm_api_options.yaml <