TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Robin Kobus 8340657ae4 refactor: Introduce DecoderOutputBuffers per batch (#3506 ) * refactor: Restructure DecoderBuffers and DecoderStepAsyncSend - Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`. - Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity. - Refactored related methods to align with the new class structure, improving maintainability and readability of the code. These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend - Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`. - Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity. - Refactored related methods to align with the new class structure, improving maintainability and readability of the code. These changes enhance the structure and readability of the batch manager's decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Log DecodingMode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce DecoderOutputBuffers and update related classes - Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose. - Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic. - Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability. - Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces. These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update MPI communicator handling - Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity. - Updated related methods and constructors to reflect the new parameter type, enhancing maintainability. - Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`. These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Replace shared_ptr with unique_ptr for buffer management - Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`. - This change enhances memory management and ownership semantics, reducing overhead and improving performance. These modifications contribute to a cleaner and more efficient architecture in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>		2025-04-22 12:25:53 +08:00
..
utils	feat: Allow individual gatherContext for each additional output (#3374 )	2025-04-12 17:00:36 +08:00
allocateKvCache.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
assignReqSeqSlots.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
cacheFormatter.cpp	bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095 )	2025-04-21 15:16:55 +08:00
cacheFormatter.h	feat: Add BW measurement (#3070 )	2025-03-28 10:53:00 +08:00
cacheTransceiver.cpp	bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095 )	2025-04-21 15:16:55 +08:00
capacityScheduler.cpp	feat: allocate minimal blocks per window size (#3028 )	2025-04-17 16:04:57 +08:00
CMakeLists.txt	chore: Ucx ip port remove mpi depend (#3101 )	2025-04-02 09:42:29 +08:00
contextProgress.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
createNewDecoderRequests.cpp	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 )	2025-04-08 23:51:27 +08:00
dataTransceiver.cpp	chore: disable some env for disagg defaultly (#3415 )	2025-04-14 10:08:10 +08:00
dataTransceiver.h	feat: Add BW measurement (#3070 )	2025-03-28 10:53:00 +08:00
dataTransceiverImpl.cpp	feat: allocate minimal blocks per window size (#3028 )	2025-04-17 16:04:57 +08:00
dataTransceiverImpl.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
decoderBuffers.cpp	refactor: Introduce DecoderOutputBuffers per batch (#3506 )	2025-04-22 12:25:53 +08:00
encoderBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
encoderBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
evictionPolicy.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
generateRequestOptions.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
guidedDecoder.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
handleContextLogits.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
handleGenerationLogits.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
kvCacheEventManager.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
kvCacheManager.cpp	bind block key and hasher (#3712 )	2025-04-21 18:50:57 +08:00
kvCacheTransferManager.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
llmRequest.cpp	feat: Allow individual gatherContext for each additional output (#3374 )	2025-04-12 17:00:36 +08:00
logitsPostProcessor.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
loraBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
loraBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
makeDecodingBatchInputOutput.cpp	refactor: batch slot management in decoder classes (#3300 )	2025-04-13 05:05:13 +08:00
medusaBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
microBatchScheduler.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
mlaCacheFormatter.cpp	bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095 )	2025-04-21 15:16:55 +08:00
mlaCacheFormatter.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
pauseRequests.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
peftCacheManager.cpp	feat: Support PeftCacheManager in Torch (#3186 )	2025-04-04 12:38:08 +08:00
promptTuningBuffers.cpp	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 )	2025-04-21 14:31:01 +08:00
rnnStateBuffers.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
rnnStateBuffers.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
rnnStateManager.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
runtimeBuffers.cpp	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 )	2025-04-21 14:31:01 +08:00
sequenceSlotManager.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
transformerBuffers.cpp	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 )	2025-04-08 23:51:27 +08:00
trtEncoderModel.cpp	feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582 )	2025-04-18 15:59:21 +08:00
trtEncoderModel.h	chore: Clean up cpp runtime (#3537 )	2025-04-15 16:06:14 +08:00
trtGptModel.h	fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size (#2983 )	2025-03-24 22:49:52 +08:00
trtGptModelFactory.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
trtGptModelInflightBatching.cpp	refactor: Introduce DecoderOutputBuffers per batch (#3506 )	2025-04-22 12:25:53 +08:00
trtGptModelInflightBatching.h	refactor: Introduce DecoderOutputBuffers per batch (#3506 )	2025-04-22 12:25:53 +08:00
trtGptModelV1.cpp	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 )	2025-04-21 14:31:01 +08:00
trtGptModelV1.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00