TensorRT-LLMs/cpp/tensorrt_llm/pybind/batch_manager
Robin Kobus 7b2818a47b
refactor: CreateNewDecoderRequests (#4452)
* refactor: CreateNewDecoderRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Consolidate request generation in CreateNewDecoderRequests

- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Simplify request handling in CreateNewDecoderRequests

- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests

- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests

- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters

- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests

- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00
..
algorithms.cpp refactor: CreateNewDecoderRequests (#4452) 2025-05-23 22:54:37 +08:00
algorithms.h Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
bindings.cpp [feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243) 2025-05-22 22:01:06 -04:00
bindings.h Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
buffers.cpp Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
buffers.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
cacheTransceiver.cpp Agent interface impl for NIXL (#4125) 2025-05-22 09:09:41 +08:00
cacheTransceiver.h Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
kvCacheManager.cpp cacheTransceiver buffer manager (#3798) 2025-04-27 11:48:15 +08:00
kvCacheManager.h Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
llmRequest.cpp feat: Add multimodal embedding field in LlmRequest (#3855) 2025-05-01 12:23:30 +08:00
llmRequest.h feat: Add multimodal embedding field in LlmRequest (#3855) 2025-05-01 12:23:30 +08:00