Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
* refactor: Copy sequence lengths once in decoder setup
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update DecoderInputBuffers to remove duplicated buffers
- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move getEmbeddingBias to anonymous namespace
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Filter context requests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: GenerateRequestOptions using more fine-grained functions
- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update TRTLLMDecoder
- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove const where we modify input buffers
- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Copy sequence lengths once in decoder setup
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Use updateDecoderBuffers in python decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix synchronize in trtllm decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Enable by default.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use guided_decoder to setup seqslots and free them.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use always decode_async and update_requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update decoder buffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix speculative decoding tests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Send new_tensors_host instead of assuming dict.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make default False in enable_trtllm_decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Partially fix mtp, partially fix py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update request states before sending disagg ctx cache.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix disagg test for torch decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add disagg serving case to guided decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Get overlap scheduling to work.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update cutlass to main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update after rebasing.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update to use decode async and update requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Properly pass information to update_requests
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make disaggregated serving a step closer to working.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase and format.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Copy new device tokens more pythonic.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Restore MTP add dummy reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add ordereddict import to py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added seq slot manager. Add test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use transmission for single tensor except when list of tensors is received.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add TRTLLMDecoder allocation to estimate max kv cache tokens.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add stream synchronization
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Format
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add decoder creation to estimate max kv.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update submodule UCXX inline with main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* refactor: Restructure DecoderBuffers and DecoderStepAsyncSend
- Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`.
- Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.
These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend
- Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`.
- Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.
These changes enhance the structure and readability of the batch manager's decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Log DecodingMode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce DecoderOutputBuffers and update related classes
- Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose.
- Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic.
- Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability.
- Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces.
These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update MPI communicator handling
- Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity.
- Updated related methods and constructors to reflect the new parameter type, enhancing maintainability.
- Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`.
These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Replace shared_ptr with unique_ptr for buffer management
- Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`.
- This change enhances memory management and ownership semantics, reducing overhead and improving performance.
These modifications contribute to a cleaner and more efficient architecture in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: Fixing issue with first gen token being returned twice with streaming
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing not_expectring_strings in test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* refactor: remove cumLogProbs and logProbs from DecoderBuffers
- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.
These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched
- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.
These changes enhance clarity and maintainability in the decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: integrate DecoderState for sequence length management in decoding process
- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.
refactor: update SlotDecoderBuffers to manage sequence lengths directly
- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.
These changes improve maintainability and streamline the decoding workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Delegate to asyncSend method in SlotDecoderBuffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update ExecutorConfig to use AdditionalModelOutput type
- Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output.
- Updated related serialization and deserialization methods to accommodate the new type.
- Adjusted tests to reflect the changes in the output handling structure.
This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove equality operator from TrtGptModelOptionalParams
- Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class.
- Updated the pybind11 bindings to remove the exposure of the equality operator to Python.
This change streamlines the class definition and reduces unnecessary complexity in the bindings.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput
- Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag.
- Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism.
- Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality.
This refactor improves the flexibility and clarity of output management in the batch processing workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce findOutputTensor utility function for output tensor retrieval
- Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity.
- Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity.
- Enhanced error checking for additional context and generation output tensors.
This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Check final indices of additional output tensors and update tests
- Added checks to verify the final indices of additional output tensors for context and generation outputs.
- Updated unit tests to verify the changes.
- Add lastTokenIds input tensor to test engines.
- Logits output depends on gatherContextLogits parameter.
- Removed gatherContextOutputs parameter from the validate method in LlmRequest.
- Context outputs do not depend on computeContextLogits parameter.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Check final indices of additional output tensors and update tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Remove equality operator from TrtGptModelOptionalParams
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Update executor.md
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Clean up includes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>