* generalizing cudagraph to multiple dynamic inputs
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* fix for failing test
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* Add test stages for sm120
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Update chip name and config name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Split tests to gb202 and gb203
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Don't flash driver for rtx-5090
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Skip the failed cases
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Change the test stage names
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Reduce 5080 jobs and add back gpu list which doesn't support dynamic driver flashing
Signed-off-by: qqiao <qqiao@nvidia.com>
* Skip failed case on gb202
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix condition to dynamic driver flashing
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
* Rewrite unit test for unified allreduce op. Removing the legacy unit test.
* Revise formats, fusion_op bindings. Put all tensors as optional inputs.
* Move the MoeAllreduceOp to a separate custom op.
* Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions.
* Add more TODOs, fixing minor bugs, and remove legacy code.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
* refactor: Restructure DecoderBuffers and DecoderStepAsyncSend
- Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`.
- Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.
These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend
- Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`.
- Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.
These changes enhance the structure and readability of the batch manager's decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Log DecodingMode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Introduce DecoderOutputBuffers and update related classes
- Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose.
- Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic.
- Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability.
- Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces.
These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Update MPI communicator handling
- Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity.
- Updated related methods and constructors to reflect the new parameter type, enhancing maintainability.
- Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`.
These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Replace shared_ptr with unique_ptr for buffer management
- Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`.
- This change enhances memory management and ownership semantics, reducing overhead and improving performance.
These modifications contribute to a cleaner and more efficient architecture in the batch manager.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* Fix hang bug when KV cache is low
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Review comments
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Fix attentiondp typo
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add CI test for this case
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* fix: Fix the insertion order for responder futures
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* fix: Fix disagg CPP
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* add llama3.2 ptp test case
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* update test list
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
---------
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* Feat: Offload ptable to cpu if enable_chunk_context
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Feat: offload ptable to cpu for chunk context mode
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix and add comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update Readme for multimodal and add a new param mm_embedding_offloading
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* fix: Correct prompt table offloading condition in PromptTuningBuffers
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up the code
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Add commits to explain copy from cpu <-> gpu using pinned memory
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix namings based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix format based on precommit
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify --mm_embedding_offloading flag
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Waive L0 tests
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* the test is fixed in PR 3711
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* Update Nemotron Super and Ultra in Supported Models and add an example
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
* Update README link to match new examples structure
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
---------
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
This change makes the draft target model works without mismatch in the vocab size
Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>