mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: stnie <82932102+stnie@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
76 lines
4.3 KiB
Markdown
76 lines
4.3 KiB
Markdown
(kv-cache-management)=
|
|
|
|
# KV Cache Management: Pools, Blocks, and Events
|
|
|
|
This document provides an overview of the internal hierarchy and event system for paged KV cache management, as implemented in the TensorRT-LLM codebase.
|
|
|
|
For more information on KV cache reuse see [KV cache reuse](kv-cache-reuse.md).
|
|
|
|
---
|
|
|
|
## Hierarchy: Pool, Block, and Page
|
|
|
|
### **Block**
|
|
- **Definition:** The smallest unit of KV cache allocation. A `KVCacheBlock` holds metadata (not the actual data) for a chunk of KV cache.
|
|
- **Purpose:** Each block represents a fixed number of tokens' worth of KV data (can be specified by `tokens_per_block` parameter).
|
|
- **Usage:** Blocks are allocated, reused, or evicted as sequences are processed.
|
|
|
|
### **Page**
|
|
- **Definition:** In this codebase, "page" is often used interchangeably with "block" (as in "paged KV cache"), but technically, a page could refer to a memory page (hardware-level), while a block is a logical unit for the cache.
|
|
- **In Practice:** The code uses "block" as the main unit; "page" is not a distinct class or struct.
|
|
|
|
### **Pool**
|
|
- **Definition:** A pool is a contiguous memory buffer (or set of buffers) that holds the actual KV data for one or more layers.
|
|
- **Types:** There are primary pools (fast GPU memory) and secondary pools (slower, e.g., CPU or offload memory).
|
|
- **Organization:** Each pool can serve multiple layers that share the same KV head configuration. Pools are managed by `KVCacheBlockPool` and tracked in vectors in `WindowBlockManager`.
|
|
- **Block ↔ Pool:** Each block is an index into a pool; the pool provides the actual storage, while the block is the metadata handle.
|
|
|
|
### **WindowBlockManager/BlockManager**
|
|
|
|
TRT-LLM supports 2 complex features related to KV cache management:
|
|
1. **Variable Group-Query Attention (VGQA)** - i.e. a different `num_kv_heads` value for different layers.
|
|
2. **Variable Sliding Window Attention (VSWA)** - i.e. a different `attention_window_size` value for different layers.
|
|
|
|
In order to support both of these features, the pool management works as described below.
|
|
|
|
But in the simple, *most common case*, for most models, where
|
|
1. [MHA/MQA/Non-variable GQA](gpt-attention.md#multi-head-multi-query-and-group-query-attention), i.e., same `num_kv_heads` value for all layers,
|
|
2. Global attention/[SWA](gpt-attention.md#sliding-window-attention-cyclic-rolling-buffer-kv-cache), i.e., same `attention_window_size` value for all layers,
|
|
|
|
only a *single* pool will be created within the structure described below.
|
|
|
|
#### KV Cache Pool Management
|
|
|
|
- **WindowBlockManager:** Manages blocks and pools for a specific attention window size. Within a `WindowBlockManager`, there can be multiple pools - each corresponding a unique number of KV heads - i.e., to support VGQA.
|
|
- **BlockManager:** Manages all `WindowBlockManager` instances, one per unique window size.
|
|
|
|
**Hierarchy Summary:**
|
|
- **Pool** (memory buffer for KV data)
|
|
- Contains many blocks.
|
|
- **Blocks** (metadata for a chunk of the pool, each block = `tokens_per_block` tokens)
|
|
- (Optionally, blocks can be swapped between primary/secondary pools.)
|
|
- **BlockManager/WindowBlockManager**: Manage pools and blocks, handle allocation, reuse, and eviction.
|
|
|
|
---
|
|
|
|
## Events in `KVCacheEventManager`
|
|
|
|
The `KVCacheEventManager` is responsible for tracking and reporting significant changes in the state of the KV cache. Events are used for logging, debugging, or possibly for external monitoring.
|
|
|
|
### **Types of Events**
|
|
- **Created Event:** When pools or blocks are created/allocated.
|
|
- **Updated Event:** When a block's state changes (e.g., moved between primary/secondary, priority updated).
|
|
- **Removed Event:** When a block is removed from the cache (evicted or released).
|
|
- **Stored Event:** When blocks are stored for potential reuse (e.g., after a sequence finishes and its blocks are reusable).
|
|
|
|
### **What Triggers an Event?**
|
|
- **Allocation/Deallocation:** Creating or freeing memory pools or blocks.
|
|
- **Eviction/Reuse:** When a block is evicted, reused, or its priority changes.
|
|
- **Block Movement:** When a block is moved between memory levels (primary ↔ secondary).
|
|
- **Block Storage:** When blocks are stored for future reuse (e.g., after a sequence completes).
|
|
|
|
**In summary:**
|
|
An "event" is any significant change in the lifecycle or state of a KV cache block or pool, tracked for monitoring, debugging, or optimization purposes.
|
|
|
|
---
|