From ab941afa2e15156a085fb092400daafec8e93d18 Mon Sep 17 00:00:00 2001 From: Bo Li <22713281+bobboli@users.noreply.github.com> Date: Tue, 17 Feb 2026 22:37:18 +0800 Subject: [PATCH] [None][doc] Update media files path in Skip Softmax blog. (#11540) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> --- ...ng_Context_Inference_with_Skip_Softmax_Attention.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md b/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md index 18fd3ff58f..57d6c9775a 100644 --- a/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md +++ b/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md @@ -29,7 +29,7 @@ In this way, we can indirectly control the sparsity via the threshold. The thres The method is fully dynamic, and can be applied to both the prefilling and decoding. The algorithm of Skip Softmax Attention is described in the paper [BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding](https://arxiv.org/pdf/2512.12087). We have also published a [Developer Blog](https://developer.nvidia.com/blog/accelerating-long-context-inference-with-skip-softmax-in-nvidia-tensorrt-llm/) for explanation. Please refer to these resources for in-depth dive into the algorithm details. We will focus on the application of Skip Softmax Attention in TensorRT-LLM to accelerate long-context inference.

- BLASST Illustration + BLASST Illustration

## Example Usage @@ -152,16 +152,16 @@ The following figures plot **speedup vs. achieved sparsity**, on top of the base

Hopper (H200)

Prefill

- Hopper prefill kernel + Hopper prefill kernel

Decode

- Hopper decode kernel + Hopper decode kernel

Blackwell (B200)

Prefill

- Blackwell prefill kernel + Blackwell prefill kernel

Decode

- Blackwell decode kernel + Blackwell decode kernel