mirror of
https://github.com/vllm-project/vllm.git
synced 2026-06-06 00:16:14 +00:00
faab189554
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2.5 KiB
2.5 KiB
IndexCache
IndexCache reduces redundant top-k computation in DeepSeek-V3.2 (DSA) models by caching and reusing top-k indices across layers.
Background
DeepSeek-V3.2 uses a DeepSeek Sparse Attention (DSA) mechanism where top-k token selection is computed per layer. For deep models with many layers, this computation can be expensive. IndexCache allows skipping redundant top-k computations by reusing indices from previous layers.
See: IndexCache Paper
Usage
CLI
vllm serve deepseek-ai/DeepSeek-V3.2 \
--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
use_index_cache |
bool | false | Enable IndexCache. Must be set to true to use this feature |
index_topk_freq |
int | 1 | Frequency (in layers) at which top-k is computed. 1 = compute on every layer (disabled), 4 = compute on 1/4 of layers |
index_topk_pattern |
str | null | Per-layer F/S pattern. Overrides index_topk_freq if set. Each character maps to one DSA layer: F = Full, S = Shared |
Configuration Examples
Using index_topk_freq (compute every N layers):
vllm serve deepseek-ai/DeepSeek-V3.2 \
--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...
Using index_topk_pattern (explicit per-layer control):
# custom pattern for 61 layers: F = compute, S = reuse
vllm serve deepseek-ai/DeepSeek-V3.2 \
--hf-overrides '{"use_index_cache": true, "index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSF"}'
How It Works
- When IndexCache is enabled, layers marked with
"F"(Full) calculate and store top-k indices - Subsequent layers marked with
"S"(Shared) receive the cached indices from the previous layer instead of recomputing - The cached indices are passed through the layer stack, reducing total computation
Requirements
- DeepSeek-V3.2 or compatible DSA model
use_index_cache: truevia--hf-overrides