TensorRT-LLMs/docs/source/advanced/kv-cache-management.md
amirkl94 fbec0c3552
Release 0.20 to main (#4577)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-28 16:25:33 +08:00

76 lines
4.3 KiB
Markdown

(kv-cache-management)=
# KV Cache Management: Pools, Blocks, and Events
This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.
For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).
---
## Hierarchy: Pool, Block, and Page
### **Block**
- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.
### **Page**
- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.
### **Pool**
- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.
### **WindowBlockManager/BlockManager**
TRT-LLM supports 2 complex features related to KV cache management:
1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.
In order to support both of these features, the pool management works as described below.
But in the simple, *most common case*, for most models, where
1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,
only a *single* pool will be created within the structure described below.
#### KV Cache Pool Management
- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.
**Hierarchy Summary:**
- **Pool** (memory buffer for KV data)
- Contains many blocks.
- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
- (Optionally, blocks can be swapped between primary/secondary pools.)
- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.
---
## Events in `KVCacheEventManager`
The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.
### **Types of Events**
- **Created Event:** When pools or blocks are created/allocated.
- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
- **Removed Event:** When a block is removed from the cache (evicted or released).
- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).
### **What Triggers an Event?**
- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).
**In summary:**
An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.
---