mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Update.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
This commit is contained in:
parent
fb48dcc5c3
commit
02a6d177f3
@ -13,8 +13,9 @@ $$
|
||||
In this way, we can indirectly control the sparsity via the threshold. Note that the threshold is inversely proportional to the context length, i.e., the longer the context, the smaller the threshold is needed to achieve the same sparsity.
|
||||
|
||||
The method is fully dynamic, and can be applied to both the prefilling and decoding. The algorithm of Skip Softmax Attention is described in the paper [BLASST: Dynamic Blocked Attention Sparsity via Softmax Thresholding](https://arxiv.org/pdf/2512.12087). We have also published a [Developer Blog](https://developer.nvidia.com/blog/accelerating-long-context-inference-with-skip-softmax-in-nvidia-tensorrt-llm/) for explanation. Please refer to these resources for in-depth dive into the algorithm details. We will focus on the application of Skip Softmax Attention in TensorRT-LLM to accelerate long-context inference.
|
||||
|
||||
<p align="center">
|
||||
<img src="../media/tech_blog16_blasst.jpg" alt="BLASST Illustration" style="width: 50%; min-width: 300px;" />
|
||||
<img src="../media/tech_blog16_blasst.jpg" alt="BLASST Illustration" style="width: 50%; min-width: 300px; display: block; margin: auto;" />
|
||||
</p>
|
||||
|
||||
## Example Usage
|
||||
|
||||
Loading…
Reference in New Issue
Block a user