mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
For VSWA scheme, we do not want `kv_cache_cnonfig.max_token` to control and cap the maximum memory of a block pool because block pool size are not identical amongst different window sizes. This MR omits the effect of `kv_cache_config.max_tokens` under `kvCacheManager.cpp` to allow the setting of block pool size to rely on the window size to share ratio and the total gpu memory analyzed and fed to the kv cache manager. Only skipping for VSWA scheme, no extra coverage was added. Signed-off-by: eopXD <yuehtingc@nvidia.com> |
||
|---|---|---|
| .. | ||
| batch_manager | ||
| common | ||
| cutlass_extensions/include/cutlass_extensions | ||
| deep_ep | ||
| deep_gemm | ||
| executor | ||
| executor_worker | ||
| kernels | ||
| layers | ||
| nanobind | ||
| plugins | ||
| pybind | ||
| runtime | ||
| testing | ||
| thop | ||
| CMakeLists.txt | ||