TensorRT-LLMs/docs/source/advanced/kv-cache-management.md

(kv-cache-management)=

# KV Cache Management: Pools, Blocks, and Events

This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.

For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).

---

## Hierarchy: Pool, Block, and Page

### **Block**
- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.

### **Page**
- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.

### **Pool**
- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.

### **WindowBlockManager/BlockManager**

TRT-LLM supports 2 complex features related to KV cache management:
1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.

In order to support both of these features, the pool management works as described below.

But in the simple, *most common case*, for most models, where
1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,

only a *single* pool will be created within the structure described below.

#### KV Cache Pool Management

- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.

**Hierarchy Summary:**
- **Pool** (memory buffer for KV data)
  - Contains many blocks.
- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
    - (Optionally, blocks can be swapped between primary/secondary pools.)
- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.

---

## Events in `KVCacheEventManager`

The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.

### **Types of Events**
- **Created Event:** When pools or blocks are created/allocated.
- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
- **Removed Event:** When a block is removed from the cache (evicted or released).
- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).

### **What Triggers an Event?**
- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).

**In summary:**
An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.

---