Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
* chore: Remove GptSession/V1 from TRT workflow
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove stateful decoders
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession buffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession utils
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession kernels
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove V1 GPT models from tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSessionBenchmark from scripts and docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSession IO classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from test lists
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove useless encoder test
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove mActualBatchSize from DecoderState
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove static batching from ExecutorTest
- Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter.
- Adjusted related test functions to reflect the changes in parameter lists.
- Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move executor recv functions into classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance MPI logging and error handling
- Updated MPI logging to include destination and tag information for better traceability during send and receive operations.
- Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests.
- Improved code structure for clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce MpiTag enumeration and update MPI function signatures
- Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability.
- Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags.
- Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity.
- Removed redundant MPI tag constants from several classes, streamlining the code.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Rename tags for consistency
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Move all casters to customCasters.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use customCasters in all bindings.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added customCasters to userbuffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use updateDecoderBuffers in python decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix synchronize in trtllm decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Enable by default.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use guided_decoder to setup seqslots and free them.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use always decode_async and update_requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update decoder buffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix speculative decoding tests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Send new_tensors_host instead of assuming dict.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make default False in enable_trtllm_decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Partially fix mtp, partially fix py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update request states before sending disagg ctx cache.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix disagg test for torch decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add disagg serving case to guided decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Get overlap scheduling to work.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update cutlass to main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update after rebasing.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update to use decode async and update requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Properly pass information to update_requests
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make disaggregated serving a step closer to working.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase and format.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Copy new device tokens more pythonic.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Restore MTP add dummy reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add ordereddict import to py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added seq slot manager. Add test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use transmission for single tensor except when list of tensors is received.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add TRTLLMDecoder allocation to estimate max kv cache tokens.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add stream synchronization
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Format
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add decoder creation to estimate max kv.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update submodule UCXX inline with main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix hang bug when KV cache is low
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Review comments
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Fix attentiondp typo
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add CI test for this case
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* fix: Fix the insertion order for responder futures
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* fix: Fix disagg CPP
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* feat: Integrate GPUDirect Storage (GDS) into Executor API
Squash of several dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* add space in test output
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* perf: reduce executor lock scope
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move TokenRangeRetentionConfig implementation to cpp file
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: Improve finished steps handling for external draft tokens
- Fixed a bug where the whole finished steps tensor was being zeroes instead of the slices.
- Replaced the creation of a temporary tensor for finished steps with a direct slice from the input tensor, improving efficiency and readability.
- Updated the tensor management logic to streamline the process of setting zero values for finished steps during batch processing.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Clean up includes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Add numNodes to ParallelConfig
If not provided, attempt to find the number of nodes by
adding the number of local ranks 0
Update device IDs check accordingly
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* Add ParallelConfig pickle test
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
---------
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* refactor: batch slot management in decoder classes
- Changed `forwardBatchSlots` from a single `TensorPtr` to a `std::vector<TensorPtr>` in `decoderBuffers.h` and updated its initialization in `decoderBuffers.cpp`.
- Updated `batchSlots` in `iGptDecoderBatched.h` to a `std::vector<TensorPtr>` for better handling of batch sizes.
- Modified `mBatchSlotsDecoder` in `statefulGptDecoderBatched.h` to use a `std::vector<TensorPtr>` and adjusted its initialization in `statefulGptDecoderBatched.cpp`.
- Ensured proper reshaping of tensors in the setup methods to accommodate the new vector structure.
These changes enhance flexibility in managing tensor buffers across different batch sizes.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Setup batch slots outside of the decoder
- Refactored batch slot management to utilize `makeBatchSlots`, enhancing clarity and functionality in batch processing.
- Introduced `DecoderState` to `MakeDecodingBatchInputOutput` for improved state handling during decoding.
- Updated the `operator()` method to include `decoderState` as a parameter, facilitating better integration with the decoding process.
- Modified related tests to accommodate changes in batch slot handling and ensure proper functionality.
These updates improve the overall structure and efficiency of the decoding process in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance decoder input structure with maxDecodingEngineTokens
- Updated the `Input` class in `iGptDecoderBatched.h` to include a new parameter `maxDecodingEngineTokens` for better control over decoding limits.
- Modified the `MakeDecodingBatchInputOutput` algorithm to compute the maximum number of decoding tokens based on active slots.
- Adjusted the `GptDecoderBatched` class to utilize the new `maxDecodingEngineTokens` parameter, improving clarity in token management during decoding.
- Updated Python bindings to reflect changes in the `Input` class constructor.
- Enhanced tests to ensure proper handling of the new parameter.
These changes improve the flexibility and efficiency of the decoding process in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Streamline decoder input creation and batch slot management
- Introduced a new function `createDecoderInputs` to encapsulate the logic for creating decoder inputs, improving code organization.
- Updated the `operator()` method to utilize the new `createDecoderInputs` function, simplifying the decoding input setup process.
- Removed the `maxOfActiveSlots` template function to streamline the logic for determining the maximum number of active decoding engine tokens.
- Introduced a direct calculation of `maxActiveDecodingEngineTokens` within the `createDecoderInputs` function, enhancing clarity and reducing complexity.
These changes enhance the maintainability and readability of the decoding process in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update logits handling in decoder batch
- Modified the `decoder_batch::Input` to accept a vector of vectors for logits, enhancing flexibility in tensor management.
- Adjusted the `createDecoderInputs` function to accommodate the new logits structure, ensuring proper batch processing.
- Updated Python bindings to reflect changes in the `Input` class constructor, maintaining compatibility with existing interfaces.
- Refactored the `GptDecoderBatched` and `StatefulGptDecoderBatched` classes to utilize the updated logits structure, improving clarity in tensor slicing and batch size management.
- Enhanced tests to validate the new input structure and ensure correct functionality across various decoding scenarios.
These changes streamline the decoding process and improve the overall maintainability of the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Rename maxDecodingEngineTokens to maxDecoderSteps
- Updated the `Input` class in `iGptDecoderBatched.h` to rename `maxDecodingEngineTokens` to `maxDecoderSteps` for improved clarity.
- Adjusted the `createDecoderInputs` function to reflect the new naming, ensuring consistency in the decoding process.
- Modified the `GptDecoderBatched` class to utilize `maxDecoderSteps` in its logic, enhancing readability and maintainability.
- Updated Python bindings to expose the renamed parameter, maintaining compatibility with existing interfaces.
These changes enhance the clarity of the decoding parameters and improve the overall structure of the codebase.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: remove usage of `active` vector from prepareForward
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Removed the `active` vector from `decoder_batch::Input`
- Removed the `active` vector from the `Input` class constructor in `iGptDecoderBatched.h`, streamlining the input handling for decoding.
- Updated the `createDecoderInputs` function and related tests to reflect the changes in the `Input` class, ensuring compatibility and maintaining functionality.
- Adjusted Python bindings to accommodate the new constructor signature, enhancing clarity in the interface.
These changes improve the maintainability and readability of the decoding process in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: remove usage of `active` vector from gptDecoderBatchedTest
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Unify the creation of decoder batch inputs in algorithm and tests
- Added a new static method `createDecoderBatchInputs` to streamline the creation of decoder batch inputs, enhancing clarity and maintainability.
- Updated the implementation to utilize active slots directly, simplifying the logic for managing batch slots and logits.
- Refactored the `operator()` method to leverage the new input creation function, ensuring compatibility with existing decoding processes.
- Enhanced tests to validate the new input handling approach, ensuring correct functionality across various scenarios.
These changes improve the overall structure and readability of the decoding process in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: remove usage of active vector from createDecoderBatchInputs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update maxDecoderSteps calculation
- Replaced integer division with `common::ceilDiv` for calculating `maxDecoderSteps` and `numDecoderSteps`, ensuring correct handling of token counts.
These changes enhance the robustness of the decoding batch input creation process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: remove cumLogProbs and logProbs from DecoderBuffers
- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.
These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched
- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.
These changes enhance clarity and maintainability in the decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: integrate DecoderState for sequence length management in decoding process
- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.
refactor: update SlotDecoderBuffers to manage sequence lengths directly
- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.
These changes improve maintainability and streamline the decoding workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Delegate to asyncSend method in SlotDecoderBuffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183)
This reverts commit 75495730bc.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Simplifiy disableLookahead method
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Update DecoderBuffers comments
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move numDecodingEngineTokens to DecoderState
This commit introduces new methods in the DecoderState class to manage the number of tokens for each request in a batch. The following changes were made:
- Added `getNumDecodingEngineTokens()` to retrieve the number of tokens for all requests.
- Added `getNumDecodingEngineTokens(SizeType32 batchIdx)` to get the token count for a specific request.
- Added `setNumDecodingEngineTokens(SizeType32 batchIdx, SizeType32 numTokens)` to set the token count for a specific request.
- Updated the setup method to initialize the token count vector based on the maximum batch size.
- Refactored the `CreateNewDecoderRequests` class to utilize the new token management methods, improving clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Improve shape variables in DecoderState
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
- Updated the `forwardAsync` method in `GptDecoderBatched` and `iGptDecoderBatched` to return `CudaEvent` instead of `DecoderFinishedEventPtr`, simplifying event handling.
- Removed the `DecoderFinishedEvent` class and its associated usage across various files, streamlining the codebase.
- Adjusted related methods and Python bindings to accommodate the new event structure, ensuring compatibility and maintaining functionality.
These changes enhance the clarity and efficiency of the decoding process in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update gatherTree function to accept CUDA stream parameter
This commit modifies the gatherTree function signature to include a runtime::CudaStream parameter, enhancing flexibility in stream management. Additionally, it removes unnecessary buffer manager parameters and stream handling from the function, streamlining the code. The finalize method in GptDecoderBatched is also updated to reflect these changes, improving clarity and maintainability in the decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update GptDecoderBatched finalize
This commit refactors the GptDecoderBatched class to improve method signatures and reduce code complexity:
- Modified finalize method to accept DecoderState as a parameter
- Updated method signatures to work with the new DecoderState approach
- Improved code organization and readability
The changes continue the ongoing refactoring to centralize decoder state management and simplify the decoder implementation.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove ForwardType enum from GptDecoderBatched
- Remove ForwardType enum from GptDecoderBatched
- Simplify forwardDispatch and forwardDecoder methods
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove forwardDecoder method from GptDecoderBatched
- Eliminate the forwardDecoder method to streamline the decoding process.
- Update forwardDispatch to directly call forwardAsync when input batch size is greater than zero.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move event handling from forwardDispatch to forwardAsync
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>