TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Dom Brown	dbd9a83b0d	feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582 ) * feat: Integrate GPUDirect Storage (GDS) into Executor API Squash of several dev commits Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-18 15:59:21 +08:00
Netanel Haber	3c52ac098f	feat: allocate minimal blocks per window size (#3028 ) * implement variable window attention by breaking the block manager into window block managers per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * revert isCyclic to be true if the min attention window is reached, not per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add explanatory comment to mCyclicThreshold Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * load correct gemma config Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * if TYPE_CHECKING Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * pass dtype as well Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * test_gemma variable sliding window attention Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove \|\| mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * turn off request delaying for MaxUtil Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * make comments better Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * windowSizesTotalSum using std::accumulate Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comments Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove assert that kills disagg tests, since it isn't necessary Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add Gemma3 to SUPPORTED_HF_ARCHITECTURES Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * support Gemma3 Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix kvfactor field for deepseek Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix gemma-3 entries in testlist to include vswa Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * only quantize gemma2 VSWA Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> remove misleading comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix: disable KV cache reuse if using attention sink (#3021) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-17 16:04:57 +08:00
Robin Kobus	b7a38feb14	chore: Clean up cpp runtime (#3537 ) * add space in test output Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * perf: reduce executor lock scope Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move TokenRangeRetentionConfig implementation to cpp file Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Improve finished steps handling for external draft tokens - Fixed a bug where the whole finished steps tensor was being zeroes instead of the slices. - Replaced the creation of a temporary tensor for finished steps with a direct slice from the input tensor, improving efficiency and readability. - Updated the tensor management logic to streamline the process of setting zero values for finished steps during batch processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Clean up includes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-15 16:06:14 +08:00
Robin Kobus	f58d4698c8	chore: Clean up cpp runtime (#3505 ) * chore: Remove unused tensors from DecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Remove unused argument from readme Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: remove unused tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove unnecessary newOutputTokens Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove unnecessary event in getDecoderSlotHostOutputs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-14 18:00:03 +08:00
Robin Kobus	2ab71f9a80	refactor: decoder buffers (#3307 ) * refactor: remove cumLogProbs and logProbs from DecoderBuffers - Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management. - Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities. These modifications enhance code clarity and maintainability by reducing redundancy in buffer management. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched - Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping. - Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width. These changes enhance clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: integrate DecoderState for sequence length management in decoding process - Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose. - Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management. - Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow. refactor: update SlotDecoderBuffers to manage sequence lengths directly - Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths. - Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy. - Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process. These changes improve maintainability and streamline the decoding workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Delegate to asyncSend method in SlotDecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-12 11:41:24 +02:00
Robin Kobus	1bd84c6d8c	feat: Allow individual gatherContext for each additional output (#3374 ) * refactor: Update ExecutorConfig to use AdditionalModelOutput type - Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output. - Updated related serialization and deserialization methods to accommodate the new type. - Adjusted tests to reflect the changes in the output handling structure. This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove equality operator from TrtGptModelOptionalParams - Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class. - Updated the pybind11 bindings to remove the exposure of the equality operator to Python. This change streamlines the class definition and reduces unnecessary complexity in the bindings. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput - Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag. - Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism. - Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality. This refactor improves the flexibility and clarity of output management in the batch processing workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce findOutputTensor utility function for output tensor retrieval - Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity. - Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity. - Enhanced error checking for additional context and generation output tensors. This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Check final indices of additional output tensors and update tests - Added checks to verify the final indices of additional output tensors for context and generation outputs. - Updated unit tests to verify the changes. - Add lastTokenIds input tensor to test engines. - Logits output depends on gatherContextLogits parameter. - Removed gatherContextOutputs parameter from the validate method in LlmRequest. - Context outputs do not depend on computeContextLogits parameter. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Check final indices of additional output tensors and update tests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Remove equality operator from TrtGptModelOptionalParams Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * docs: Update executor.md Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Clean up includes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-12 17:00:36 +08:00
wili	54ad95eaa8	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 ) * feat/Variable-Beam-Width-Search-Part3, v1.0 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.1 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>	2025-04-08 23:51:27 +08:00
tburt-nv	7a659885e3	chore: remove usernames from comments (#3291 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-05 13:44:28 +08:00
Robin Kobus	77724b0fcb	Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183 ) (#3195 ) * Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183) This reverts commit `75495730bc`. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-04 15:56:28 +02:00
Robin Kobus	d7386d14a8	refactor: Simplify disableLookahead and improve numDecodingEngineTokens handling (#3103 ) * refactor: Simplifiy disableLookahead method Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * Update DecoderBuffers comments Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move numDecodingEngineTokens to DecoderState This commit introduces new methods in the DecoderState class to manage the number of tokens for each request in a batch. The following changes were made: - Added `getNumDecodingEngineTokens()` to retrieve the number of tokens for all requests. - Added `getNumDecodingEngineTokens(SizeType32 batchIdx)` to get the token count for a specific request. - Added `setNumDecodingEngineTokens(SizeType32 batchIdx, SizeType32 numTokens)` to set the token count for a specific request. - Updated the setup method to initialize the token count vector based on the maximum batch size. - Refactored the `CreateNewDecoderRequests` class to utilize the new token management methods, improving clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Improve shape variables in DecoderState Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-01 18:47:31 +08:00
QI JUN	75495730bc	Revert "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183 ) This reverts commit `3ee4332fb1`. Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-01 12:49:27 +08:00
Robin Kobus	3ee4332fb1	refactor: Replace DecoderFinishedEvent with CudaEvent in decoder classes (#3078 ) - Updated the `forwardAsync` method in `GptDecoderBatched` and `iGptDecoderBatched` to return `CudaEvent` instead of `DecoderFinishedEventPtr`, simplifying event handling. - Removed the `DecoderFinishedEvent` class and its associated usage across various files, streamlining the codebase. - Adjusted related methods and Python bindings to accommodate the new event structure, ensuring compatibility and maintaining functionality. These changes enhance the clarity and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-28 14:50:52 +08:00
Robin Kobus	45134d7095	refactor: Improve decoder finalize function (#3077 ) * refactor: Update gatherTree function to accept CUDA stream parameter This commit modifies the gatherTree function signature to include a runtime::CudaStream parameter, enhancing flexibility in stream management. Additionally, it removes unnecessary buffer manager parameters and stream handling from the function, streamlining the code. The finalize method in GptDecoderBatched is also updated to reflect these changes, improving clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update GptDecoderBatched finalize This commit refactors the GptDecoderBatched class to improve method signatures and reduce code complexity: - Modified finalize method to accept DecoderState as a parameter - Updated method signatures to work with the new DecoderState approach - Improved code organization and readability The changes continue the ongoing refactoring to centralize decoder state management and simplify the decoder implementation. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-28 14:33:59 +08:00
Netanel Haber	da0b0e0ee3	fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size (#2983 ) * fix variable window size reuse - disable when min attention window starts sliding, not max * isPreCyclic -> isCyclic, and invert logic, for clarity * getDecoderState() Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-03-24 22:49:52 +08:00
Kaiyu Xie	2631f21089	Update (#2978 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-03-23 16:39:35 +08:00
Kaiyu Xie	9b931c0f63	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00

16 Commits