* refactor: Update ExecutorConfig to use AdditionalModelOutput type
- Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output.
- Updated related serialization and deserialization methods to accommodate the new type.
- Adjusted tests to reflect the changes in the output handling structure.
This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Remove equality operator from TrtGptModelOptionalParams
- Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class.
- Updated the pybind11 bindings to remove the exposure of the equality operator to Python.
This change streamlines the class definition and reduces unnecessary complexity in the bindings.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput
- Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag.
- Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism.
- Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality.
This refactor improves the flexibility and clarity of output management in the batch processing workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce findOutputTensor utility function for output tensor retrieval
- Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity.
- Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity.
- Enhanced error checking for additional context and generation output tensors.
This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Check final indices of additional output tensors and update tests
- Added checks to verify the final indices of additional output tensors for context and generation outputs.
- Updated unit tests to verify the changes.
- Add lastTokenIds input tensor to test engines.
- Logits output depends on gatherContextLogits parameter.
- Removed gatherContextOutputs parameter from the validate method in LlmRequest.
- Context outputs do not depend on computeContextLogits parameter.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Check final indices of additional output tensors and update tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! refactor: Remove equality operator from TrtGptModelOptionalParams
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* docs: Update executor.md
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Clean up includes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.
* In common usecase, the Userbuffers will be allocated correctly during
warm up stage. There is no dynamic allocation during inference.
* UB fusion pattern is rewroten using the new UB Allocator. It contains
following passes:
1. Fuse Quant with allreduce, replace with UB impl, and insert a
copy_to_userbuffers. Currently the normal allreduce still does not
support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
allreduce.
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
- Updated the first dimension of additional output tensors to match mMaxNewTokens.
- Copy output of last context token to generation outputs.
- Adjusted the expected output size calculations in unit tests to reflect the correct maximum output length.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update gatherTree function to accept CUDA stream parameter
This commit modifies the gatherTree function signature to include a runtime::CudaStream parameter, enhancing flexibility in stream management. Additionally, it removes unnecessary buffer manager parameters and stream handling from the function, streamlining the code. The finalize method in GptDecoderBatched is also updated to reflect these changes, improving clarity and maintainability in the decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update GptDecoderBatched finalize
This commit refactors the GptDecoderBatched class to improve method signatures and reduce code complexity:
- Modified finalize method to accept DecoderState as a parameter
- Updated method signatures to work with the new DecoderState approach
- Improved code organization and readability
The changes continue the ongoing refactoring to centralize decoder state management and simplify the decoder implementation.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>