mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-02-16 07:53:55 +08:00
[None][doc] Update Skip Softmax attention blog. (#11443)
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
This commit is contained in:
parent
8ebd6056fa
commit
18c992efb1
@ -22,7 +22,7 @@ In this blog, we introduce **Skip Softmax Attention**, a drop-in sparse attentio
|
||||
|
||||
The idea of Skip Softmax Attention is to compare the local maximum $\tilde{m}_i^{(j)}$ of $Q \cdot K^T$ with the running global maximum $m_i^{(j)}$, and skip the softmax (exp) and BMM2 calculation for blocks that are below a certain threshold $\lambda$:
|
||||
|
||||
$$\tilde{m}_i^{(j)} - m_i^{(j)} < \lambda$$
|
||||
$$\exp(\tilde{m}_i^{(j)} - m_i^{(j)}) < \lambda$$
|
||||
|
||||
In this way, we can indirectly control the sparsity via the threshold. The threshold is set to be inversely proportional to the context length, i.e., the longer the context, the smaller the threshold is needed to achieve the same sparsity.
|
||||
|
||||
@ -273,4 +273,4 @@ throughput --dataset ${OUTPUT_DIR}/dumped_ids.json \
|
||||
|
||||
|
||||
## Conclusion
|
||||
Skip Softmax Attention is a kernel-based solution for accelerating the attention. Due to the design that BMM1 ($Q \cdot K^T$) in the attention kernel is not skipped, the performance gain is capped to 1.8x at kernel level. Nevertheless, it excels at achieving high sparsity with minimal accuracy degradation, and is especially effective in the medium-to-long context scenarios where previous methods like MInference cannot well handle, because the introduced runtime overhead may not pay off the speedup of the attention kernel. The drop-in nature of Skip Softmax Attention makes it a flexible, easy-to-use method for accelerating long-context inference. The Skip Softmax Attention kernels will also be available in FlashInfer for adoptions by the open-source community.
|
||||
Skip Softmax Attention is a kernel-based solution for accelerating the attention. Due to the design that BMM1 ($Q \cdot K^T$) in the attention kernel is not skipped, the performance gain is capped to 1.8x at kernel level. Nevertheless, it excels at achieving high sparsity with minimal accuracy degradation, and is especially effective in the medium-to-long context scenarios where previous methods like MInference cannot well handle, because the introduced runtime overhead may not pay off the speedup of the attention kernel. The drop-in nature of Skip Softmax Attention makes it a flexible, easy-to-use method for accelerating long-context inference. The Skip Softmax Attention kernels will also be [available in FlashInfer](https://github.com/flashinfer-ai/flashinfer/issues/2306) for adoptions by the open-source community.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user