mirror of https://github.com/vllm-project/vllm.git synced 2026-06-06 00:16:14 +00:00

Files

T

Chauncey faab189554 [Feature]: IndexCache support for DSA models (#37735 )

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

2026-04-29 15:15:35 -04:00

2.5 KiB

Raw Permalink Blame History

IndexCache

IndexCache reduces redundant top-k computation in DeepSeek-V3.2 (DSA) models by caching and reusing top-k indices across layers.

Background

DeepSeek-V3.2 uses a DeepSeek Sparse Attention (DSA) mechanism where top-k token selection is computed per layer. For deep models with many layers, this computation can be expensive. IndexCache allows skipping redundant top-k computations by reusing indices from previous layers.

See: IndexCache Paper

Usage

CLI

vllm serve deepseek-ai/DeepSeek-V3.2 \
    --hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...

Configuration Reference

Parameter	Type	Default	Description
`use_index_cache`	bool	false	Enable IndexCache. Must be set to true to use this feature
`index_topk_freq`	int	1	Frequency (in layers) at which top-k is computed. 1 = compute on every layer (disabled), 4 = compute on 1/4 of layers
`index_topk_pattern`	str	null	Per-layer F/S pattern. Overrides index_topk_freq if set. Each character maps to one DSA layer: F = Full, S = Shared

Configuration Examples

Using index_topk_freq (compute every N layers):

vllm serve deepseek-ai/DeepSeek-V3.2 \
    --hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...

Using index_topk_pattern (explicit per-layer control):

# custom pattern for 61 layers: F = compute, S = reuse
vllm serve deepseek-ai/DeepSeek-V3.2 \
    --hf-overrides '{"use_index_cache": true, "index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSF"}'

How It Works

When IndexCache is enabled, layers marked with "F" (Full) calculate and store top-k indices
Subsequent layers marked with "S" (Shared) receive the cached indices from the previous layer instead of recomputing
The cached indices are passed through the layer stack, reducing total computation

Requirements

DeepSeek-V3.2 or compatible DSA model
use_index_cache: true via --hf-overrides

2.5 KiB Raw Permalink Blame History