Commit Graph

114 Commits

Author SHA1 Message Date
Chuang Zhu
e2318756ed
cacheTransceiver buffer manager (#3798)
* cacheTransceiver buffer manager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix args

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* cpp kvCacheManager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-27 11:48:15 +08:00
Julien Debache
0c6c8eaffd
fix: 5197419 and removed unused runtime kernels (#3631)
- Removed kernel under test call, as it was not needed
- Removed kernel itself
- Removed kernel tests
- Removed other unused kernels and their tests
- Some static analysis clean up
2025-04-23 18:04:50 +02:00
Daniel Cámpora
1299f27c74
fix: Fix C++ decoder synchronization in PyTorch (#3106)
* Use updateDecoderBuffers in python decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix synchronize in trtllm decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Enable by default.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use guided_decoder to setup seqslots and free them.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use always decode_async and update_requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update decoder buffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix speculative decoding tests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Send new_tensors_host instead of assuming dict.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make default False in enable_trtllm_decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Partially fix mtp, partially fix py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update request states before sending disagg ctx cache.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix disagg test for torch decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add disagg serving case to guided decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Get overlap scheduling to work.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update cutlass to main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update after rebasing.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update to use decode async and update requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Properly pass information to update_requests

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make disaggregated serving a step closer to working.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase and format.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Copy new device tokens more pythonic.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Restore MTP add dummy reqs.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add ordereddict import to py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added seq slot manager. Add test.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use transmission for single tensor except when list of tensors is received.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add TRTLLMDecoder allocation to estimate max kv cache tokens.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add stream synchronization

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Format

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add decoder creation to estimate max kv.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update submodule UCXX inline with main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-04-23 23:55:27 +08:00
Robin Kobus
8340657ae4
refactor: Introduce DecoderOutputBuffers per batch (#3506)
* refactor: Restructure DecoderBuffers and DecoderStepAsyncSend

- Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`.
- Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend

- Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`.
- Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes enhance the structure and readability of the batch manager's decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Log DecodingMode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce DecoderOutputBuffers and update related classes

- Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose.
- Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic.
- Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability.
- Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces.

These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update MPI communicator handling

- Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity.
- Updated related methods and constructors to reflect the new parameter type, enhancing maintainability.
- Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`.

These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Replace shared_ptr with unique_ptr for buffer management

- Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`.
- This change enhances memory management and ownership semantics, reducing overhead and improving performance.

These modifications contribute to a cleaner and more efficient architecture in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-22 12:25:53 +08:00
Zheng Duan
ae48abefc1
bind block key and hasher (#3712) 2025-04-21 18:50:57 +08:00
Iman Tabrizian
af04b6f6aa
bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
* Fix hang bug when KV cache is low

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Review comments

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Fix attentiondp typo

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Add CI test for this case

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* fix: Fix the insertion order for responder futures

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* fix: Fix disagg CPP

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-21 15:16:55 +08:00
katec846
eeb605abd6
feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380)
* Feat: Offload ptable to cpu if enable_chunk_context

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Feat: offload ptable to cpu for chunk context mode

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix and add comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update Readme for multimodal and add a new param mm_embedding_offloading

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* fix: Correct prompt table offloading condition in PromptTuningBuffers

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up the code

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Add commits to explain copy from cpu <-> gpu using pinned memory

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix namings based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix format based on precommit

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify --mm_embedding_offloading flag

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-21 14:31:01 +08:00
Dom Brown
dbd9a83b0d
feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582)
* feat: Integrate GPUDirect Storage (GDS) into Executor API

Squash of several dev commits

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-04-18 15:59:21 +08:00
Netanel Haber
3c52ac098f
feat: allocate minimal blocks per window size (#3028)
* implement variable window attention by breaking the block manager into window block managers per window size

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* revert isCyclic to be true if the min attention window is reached, not per window size

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* add explanatory comment to mCyclicThreshold

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* load correct gemma config

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* if TYPE_CHECKING

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* set temp_attention_window_inputs to None explicitly

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* set temp_attention_window_inputs to None explicitly

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* pass dtype as well

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* test_gemma variable sliding window attention

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers)

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* remove || mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* turn off request delaying for MaxUtil

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* make comments better

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* windowSizesTotalSum using std::accumulate

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix comments

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* remove assert that kills disagg tests, since it isn't necessary

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* add Gemma3 to SUPPORTED_HF_ARCHITECTURES

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* support Gemma3

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix kvfactor field for deepseek

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix comment

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix gemma-3 entries in testlist to include vswa

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* only quantize gemma2 VSWA

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

remove misleading comment

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

fix test_gemma

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix test_gemma

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix test_gemma

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>

* fix: disable KV cache reuse if using attention sink (#3021)

* fix: disable KV cache reuse if using attention sink

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: disable KV cache reuse if sink bubble

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* add comment

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-17 16:04:57 +08:00
Robin Kobus
fffb403125
fix: disable KV cache reuse if using attention sink (#3021)
* fix: disable KV cache reuse if using attention sink

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: disable KV cache reuse if sink bubble

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* add comment

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-16 03:07:32 +08:00
Robin Kobus
b7a38feb14
chore: Clean up cpp runtime (#3537)
* add space in test output

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* perf: reduce executor lock scope

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move TokenRangeRetentionConfig implementation to cpp file

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: Improve finished steps handling for external draft tokens

- Fixed a bug where the whole finished steps tensor was being zeroes instead of the slices.
- Replaced the creation of a temporary tensor for finished steps with a direct slice from the input tensor, improving efficiency and readability.
- Updated the tensor management logic to streamline the process of setting zero values for finished steps during batch processing.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Clean up includes

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-15 16:06:14 +08:00
Robin Kobus
f58d4698c8
chore: Clean up cpp runtime (#3505)
* chore: Remove unused tensors from DecoderBuffers

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: Remove unused argument from readme

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: remove unused tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove unnecessary newOutputTokens

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove unnecessary event in getDecoderSlotHostOutputs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-14 18:00:03 +08:00
Chuang Zhu
75e13f4f88
chore: disable some env for disagg defaultly (#3415)
* disable some env for disagg defaultly

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* doc

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* remove

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-14 10:08:10 +08:00
Robin Kobus
ceec4924d9
refactor: batch slot management in decoder classes (#3300)
* refactor: batch slot management in decoder classes

- Changed `forwardBatchSlots` from a single `TensorPtr` to a `std::vector<TensorPtr>` in `decoderBuffers.h` and updated its initialization in `decoderBuffers.cpp`.
- Updated `batchSlots` in `iGptDecoderBatched.h` to a `std::vector<TensorPtr>` for better handling of batch sizes.
- Modified `mBatchSlotsDecoder` in `statefulGptDecoderBatched.h` to use a `std::vector<TensorPtr>` and adjusted its initialization in `statefulGptDecoderBatched.cpp`.
- Ensured proper reshaping of tensors in the setup methods to accommodate the new vector structure.

These changes enhance flexibility in managing tensor buffers across different batch sizes.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Setup batch slots outside of the decoder

- Refactored batch slot management to utilize `makeBatchSlots`, enhancing clarity and functionality in batch processing.
- Introduced `DecoderState` to `MakeDecodingBatchInputOutput` for improved state handling during decoding.
- Updated the `operator()` method to include `decoderState` as a parameter, facilitating better integration with the decoding process.
- Modified related tests to accommodate changes in batch slot handling and ensure proper functionality.

These updates improve the overall structure and efficiency of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance decoder input structure with maxDecodingEngineTokens

- Updated the `Input` class in `iGptDecoderBatched.h` to include a new parameter `maxDecodingEngineTokens` for better control over decoding limits.
- Modified the `MakeDecodingBatchInputOutput` algorithm to compute the maximum number of decoding tokens based on active slots.
- Adjusted the `GptDecoderBatched` class to utilize the new `maxDecodingEngineTokens` parameter, improving clarity in token management during decoding.
- Updated Python bindings to reflect changes in the `Input` class constructor.
- Enhanced tests to ensure proper handling of the new parameter.

These changes improve the flexibility and efficiency of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Streamline decoder input creation and batch slot management

- Introduced a new function `createDecoderInputs` to encapsulate the logic for creating decoder inputs, improving code organization.
- Updated the `operator()` method to utilize the new `createDecoderInputs` function, simplifying the decoding input setup process.
- Removed the `maxOfActiveSlots` template function to streamline the logic for determining the maximum number of active decoding engine tokens.
- Introduced a direct calculation of `maxActiveDecodingEngineTokens` within the `createDecoderInputs` function, enhancing clarity and reducing complexity.

These changes enhance the maintainability and readability of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update logits handling in decoder batch

- Modified the `decoder_batch::Input` to accept a vector of vectors for logits, enhancing flexibility in tensor management.
- Adjusted the `createDecoderInputs` function to accommodate the new logits structure, ensuring proper batch processing.
- Updated Python bindings to reflect changes in the `Input` class constructor, maintaining compatibility with existing interfaces.
- Refactored the `GptDecoderBatched` and `StatefulGptDecoderBatched` classes to utilize the updated logits structure, improving clarity in tensor slicing and batch size management.
- Enhanced tests to validate the new input structure and ensure correct functionality across various decoding scenarios.

These changes streamline the decoding process and improve the overall maintainability of the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Rename maxDecodingEngineTokens to maxDecoderSteps

- Updated the `Input` class in `iGptDecoderBatched.h` to rename `maxDecodingEngineTokens` to `maxDecoderSteps` for improved clarity.
- Adjusted the `createDecoderInputs` function to reflect the new naming, ensuring consistency in the decoding process.
- Modified the `GptDecoderBatched` class to utilize `maxDecoderSteps` in its logic, enhancing readability and maintainability.
- Updated Python bindings to expose the renamed parameter, maintaining compatibility with existing interfaces.

These changes enhance the clarity of the decoding parameters and improve the overall structure of the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: remove usage of `active` vector from prepareForward

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Removed the `active` vector from `decoder_batch::Input`

- Removed the `active` vector from the `Input` class constructor in `iGptDecoderBatched.h`, streamlining the input handling for decoding.
- Updated the `createDecoderInputs` function and related tests to reflect the changes in the `Input` class, ensuring compatibility and maintaining functionality.
- Adjusted Python bindings to accommodate the new constructor signature, enhancing clarity in the interface.

These changes improve the maintainability and readability of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: remove usage of `active` vector from gptDecoderBatchedTest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Unify the creation of decoder batch inputs in algorithm and tests

- Added a new static method `createDecoderBatchInputs` to streamline the creation of decoder batch inputs, enhancing clarity and maintainability.
- Updated the implementation to utilize active slots directly, simplifying the logic for managing batch slots and logits.
- Refactored the `operator()` method to leverage the new input creation function, ensuring compatibility with existing decoding processes.
- Enhanced tests to validate the new input handling approach, ensuring correct functionality across various scenarios.

These changes improve the overall structure and readability of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: remove usage of active vector from createDecoderBatchInputs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update maxDecoderSteps calculation

- Replaced integer division with `common::ceilDiv` for calculating `maxDecoderSteps` and `numDecoderSteps`, ensuring correct handling of token counts.

These changes enhance the robustness of the decoding batch input creation process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-13 05:05:13 +08:00
Robin Kobus
2ab71f9a80
refactor: decoder buffers (#3307)
* refactor: remove cumLogProbs and logProbs from DecoderBuffers

- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.

These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched

- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.

These changes enhance clarity and maintainability in the decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: integrate DecoderState for sequence length management in decoding process

- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.

refactor: update SlotDecoderBuffers to manage sequence lengths directly

- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.

These changes improve maintainability and streamline the decoding workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Delegate to asyncSend method in SlotDecoderBuffers

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-12 11:41:24 +02:00
Robin Kobus
1bd84c6d8c
feat: Allow individual gatherContext for each additional output (#3374)
* refactor: Update ExecutorConfig to use AdditionalModelOutput type

- Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output.
- Updated related serialization and deserialization methods to accommodate the new type.
- Adjusted tests to reflect the changes in the output handling structure.

This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove equality operator from TrtGptModelOptionalParams

- Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class.
- Updated the pybind11 bindings to remove the exposure of the equality operator to Python.

This change streamlines the class definition and reduces unnecessary complexity in the bindings.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput

- Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag.
- Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism.
- Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality.

This refactor improves the flexibility and clarity of output management in the batch processing workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce findOutputTensor utility function for output tensor retrieval

- Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity.
- Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity.
- Enhanced error checking for additional context and generation output tensors.

This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Check final indices of additional output tensors and update tests

- Added checks to verify the final indices of additional output tensors for context and generation outputs.
- Updated unit tests to verify the changes.
  - Add lastTokenIds input tensor to test engines.
  - Logits output depends on gatherContextLogits parameter.
- Removed gatherContextOutputs parameter from the validate method in LlmRequest.
  - Context outputs do not depend on computeContextLogits parameter.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Check final indices of additional output tensors and update tests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Remove equality operator from TrtGptModelOptionalParams

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* docs: Update executor.md

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Clean up includes

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-12 17:00:36 +08:00
wili
54ad95eaa8
Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338)
* feat/Variable-Beam-Width-Search-Part3, v1.0

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat/Variable-Beam-Width-Search-Part3, v1.1

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat/Variable-Beam-Width-Search-Part3, v1.2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>
2025-04-08 23:51:27 +08:00
Chuang Zhu
1c88af1378
feat: use cudaMalloc to allocate kvCache (#3303) 2025-04-08 10:59:14 +08:00
tburt-nv
7a659885e3
chore: remove usernames from comments (#3291)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-05 13:44:28 +08:00
Robin Kobus
77724b0fcb
Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183) (#3195)
* Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183)

This reverts commit 75495730bc.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183)

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-04 15:56:28 +02:00
shaharmor98
ee4aab72ec
feat: Support PeftCacheManager in Torch (#3186)
* Add PeftCacheManager implementation

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-04 12:38:08 +08:00
Robin Kobus
b5bc0a9fcd
chore: Add output of first token to additional generation outputs (#3205)
- Updated the first dimension of additional output tensors to match mMaxNewTokens.
- Copy output of last context token to generation outputs.
- Adjusted the expected output size calculations in unit tests to reflect the correct maximum output length.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-02 20:14:16 +08:00
Chuang Zhu
bc5811da65
chore: Ucx ip port remove mpi depend (#3101)
* initial ucx support

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fixes to support dynloading and ucx connection establishment - not stable yet

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* update

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* more connection bringup fixes - faillig on connection vector build

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* executor test pass

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* update

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* passed full benchmark

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* changing to TLLM_THROW and removing cout

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* stoping progress thread at ucxComm destructor

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fixing build with ENABLE_UCX=0 to not build ucx traget at all and removing includes for ucxConnection for cache transceiver, also delete commented cold code

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fix copyrights

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* adding ucx flavor to cache transceiver test and insertto the CI pipeline

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* allowing sending non ib interfaces IPs

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* setting UCX port reuse for the tests in pipeline

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* code review fixes

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* querying ep after GID message is sent to avoid UCX Errors

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fixing more CR issues

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* querying ep to not fail is ep_not_connected yet

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* remove mpi dependency and debug

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* debug to info

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* mpirun n 2

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* remove mpi comm split when disaggOrchestrator mode

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* waive disagg_mtp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use future instead of thread

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use future_promise instead of cv wait

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* connectionId type

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* improve test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* imporve test 2

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* gtest_skip

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>
2025-04-02 09:42:29 +08:00
Robin Kobus
d7386d14a8
refactor: Simplify disableLookahead and improve numDecodingEngineTokens handling (#3103)
* refactor: Simplifiy disableLookahead method

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* Update DecoderBuffers comments

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move numDecodingEngineTokens to DecoderState

This commit introduces new methods in the DecoderState class to manage the number of tokens for each request in a batch. The following changes were made:
- Added `getNumDecodingEngineTokens()` to retrieve the number of tokens for all requests.
- Added `getNumDecodingEngineTokens(SizeType32 batchIdx)` to get the token count for a specific request.
- Added `setNumDecodingEngineTokens(SizeType32 batchIdx, SizeType32 numTokens)` to set the token count for a specific request.
- Updated the setup method to initialize the token count vector based on the maximum batch size.
- Refactored the `CreateNewDecoderRequests` class to utilize the new token management methods, improving clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Improve shape variables in DecoderState

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-01 18:47:31 +08:00
QI JUN
75495730bc
Revert "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183)
This reverts commit 3ee4332fb1.

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-01 12:49:27 +08:00
Robin Kobus
3ee4332fb1
refactor: Replace DecoderFinishedEvent with CudaEvent in decoder classes (#3078)
- Updated the `forwardAsync` method in `GptDecoderBatched` and `iGptDecoderBatched` to return `CudaEvent` instead of `DecoderFinishedEventPtr`, simplifying event handling.
- Removed the `DecoderFinishedEvent` class and its associated usage across various files, streamlining the codebase.
- Adjusted related methods and Python bindings to accommodate the new event structure, ensuring compatibility and maintaining functionality.

These changes enhance the clarity and efficiency of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-03-28 14:50:52 +08:00
Robin Kobus
45134d7095
refactor: Improve decoder finalize function (#3077)
* refactor: Update gatherTree function to accept CUDA stream parameter

This commit modifies the gatherTree function signature to include a runtime::CudaStream parameter, enhancing flexibility in stream management. Additionally, it removes unnecessary buffer manager parameters and stream handling from the function, streamlining the code. The finalize method in GptDecoderBatched is also updated to reflect these changes, improving clarity and maintainability in the decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update GptDecoderBatched finalize

This commit refactors the GptDecoderBatched class to improve method signatures and reduce code complexity:

- Modified finalize method to accept DecoderState as a parameter
- Updated method signatures to work with the new DecoderState approach
- Improved code organization and readability

The changes continue the ongoing refactoring to centralize decoder state management and simplify the decoder implementation.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-03-28 14:33:59 +08:00
BatshevaBlack
3e37531c6a
feat: Add BW measurement (#3070) 2025-03-28 10:53:00 +08:00
wili
3e035f2219
v1.2 (#3082)
Signed-off-by: wili <wili@nvidia.com>
2025-03-26 23:31:29 +08:00
Shunkangz
8ee840159b
Add updateKVCacheTransfer (#2984)
Add kv cache transfer measurement
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-03-25 21:45:35 +08:00
Aurelien Chartier
a33c595c88
Fix logits dtype in assert (#3038)
Remove extra methods in trtGptModelInflightBatching.h. The methods were moved out of that class during a previous refactoring, but the definitions have been left behind.

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-03-25 10:35:21 +08:00
Netanel Haber
da0b0e0ee3
fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size (#2983)
* fix variable window size reuse - disable when *min attention window* starts sliding, not max

* isPreCyclic -> isCyclic, and invert logic, for clarity

* getDecoderState()

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-03-24 22:49:52 +08:00
Kaiyu Xie
2631f21089
Update (#2978)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-23 16:39:35 +08:00
Kaiyu Xie
3aa6b11d13
Update TensorRT-LLM (#2936)
* Update TensorRT-LLM

---------

Co-authored-by: changcui <cuichang147@gmail.com>
2025-03-18 21:25:19 +08:00
Kaiyu Xie
9b931c0f63
Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
Kaiyu Xie
77d7fe1eb2
Update TensorRT-LLM (#2849)
* Update TensorRT-LLM

---------

Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00
Kaiyu Xie
ab5b19e027
Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
Kaiyu Xie
2ea17cdad2
Update TensorRT-LLM (#2792)
* Update TensorRT-LLM

---------

Co-authored-by: jlee <jungmoolee@clika.io>
2025-02-18 21:27:39 +08:00
Kaiyu Xie
e88da961c5
Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
Dan Blanaru
16d2467ea8 Update TensorRT-LLM (#2755)
* Update TensorRT-LLM

---------

Co-authored-by: Denis Kayshev <topenkoff@gmail.com>
Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>

Update
2025-02-11 03:01:00 +00:00
Kaiyu Xie
be17881062
Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
Kaiyu Xie
aaacc9bd68
Update TensorRT-LLM (#2562)
* Update TensorRT-LLM

---------

Co-authored-by: Starrick Liu <73152103+StarrickLiu@users.noreply.github.com>
2024-12-11 00:31:05 -08:00
石晓伟
548b5b7310
Update TensorRT-LLM (#2532)
* blossom-ci.yml: run vulnerability scan on blossom

* open source efb18c1256f8c9c3d47b7d0c740b83e5d5ebe0ec

---------

Co-authored-by: niukuo <6831097+niukuo@users.noreply.github.com>
Co-authored-by: pei0033 <59505847+pei0033@users.noreply.github.com>
Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2024-12-04 21:16:56 +08:00
Kaiyu Xie
385626572d
Update TensorRT-LLM (#2502)
* Update TensorRT-LLM

---------

Co-authored-by: 岑灿 <yunyi.hyy@alibaba-inc.com>
2024-11-26 16:51:34 +08:00
Kaiyu Xie
535c9cc673
Update TensorRT-LLM (#2460) 2024-11-19 18:30:34 +08:00
Kaiyu Xie
c629546ce4
Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
Kaiyu Xie
b7868dd1bd
Update TensorRT-LLM (#2413) 2024-11-05 16:27:06 +08:00
Kaiyu Xie
f14d1d433c
Update TensorRT-LLM (#2389)
* Update TensorRT-LLM

---------

Co-authored-by: Alessio Netti <netti.alessio@gmail.com>
2024-10-29 22:24:38 +08:00
Kaiyu Xie
1730a587d8
Update TensorRT-LLM (#2363)
* Update TensorRT-LLM

---------

Co-authored-by: tonylek <137782967+tonylek@users.noreply.github.com>
2024-10-22 20:27:35 +08:00
Kaiyu Xie
75057cd036
Update TensorRT-LLM (#2333)
* Update TensorRT-LLM

---------

Co-authored-by: Puneesh Khanna <puneesh.khanna@tii.ae>
Co-authored-by: Ethan Zhang <26497102+ethnzhng@users.noreply.github.com>
2024-10-15 15:28:40 +08:00