Commit Graph

449 Commits

Author SHA1 Message Date
yuxianq
0e7e949feb
refactor: Split llama4 model from llama model. (#3530)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-15 13:41:05 +08:00
tburt-nv
e1e068d4f3
fix local user (#3550)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-15 13:20:34 +08:00
Bo Li
5eae397b3b
doc: Update instructions to enable FP8 MLA for Deepseek. (#3488)
* doc: Update doc to enable FP8 MLA for Deepseek.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Update.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Update.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Update the status on Hopper and Blackwell.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Update.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

* Update table of contents.

Signed-off-by: Bo Li <bobboli0202@gmail.com>

---------

Signed-off-by: Bo Li <bobboli0202@gmail.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-04-15 13:12:33 +08:00
Zheng Duan
b0cb963199
test: torch-flow conditional disagg test (#3410)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-04-15 10:54:14 +08:00
Jinyang Yuan
175adb94ab
chore: Log memory sizes of weights and activations separately (#3467)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-04-15 09:48:35 +08:00
nv-guomingz
b32ae7ac92
test:add fp8_kv_cache functionality test case. (#3457)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-15 09:16:46 +08:00
QI JUN
112f716155
chore: move all distributed related codes into _torch.distributed directory (#3511)
* move all distributed related codes into _torch.distributed directory

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix ci

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-15 08:39:17 +08:00
brb-nv
098ca7f68c
test: Fix breaking Phi3 multimodal tests (#3544)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-15 08:02:34 +08:00
Iman Tabrizian
bad55e99bb
test: Add MTP + overlap + Attention DP disaggregated test (#3542)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-15 07:46:03 +08:00
Pamela Peng
6cdfc54883
feat: Add FP8 support for SM 120 (#3248)
* Allow FP8 on SM120

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* fix sm121

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* fix

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* fix pre-commit

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

* review update

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>

---------

Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-04-14 16:05:41 -07:00
tburt-nv
c0dd6cbce0
infra: move nvrtc_wrapper to conan (#3282)
* add pip scripts dir to path
* move nvrtc_wrapper to conan
* support building nvrtc wrapper from source

---------

Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-15 05:31:01 +08:00
Aurelien Chartier
8cf2785bc6
chore: unify pp_layers helpers (#3429)
* chore: unify pp_layers helpers

Fix assumptions about equal number of layers per PP rank
in prepare_attention_inputs

Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-04-15 04:49:17 +08:00
Chang Liu
01cb3ccb04
use global expert idx to load expert weights (#3386)
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-14 11:58:30 -07:00
Chang Liu
1902d73eb5
fix: llama4: add an option apply_router_weight_on_input for in FusedMoE (#3492)
* apply a tenative fix to moe bypass kernel update

* Pass none to disable final stage in moe

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Chang Liu <lc9114@gmail.com>

---------

Signed-off-by: Chang Liu <lc9114@gmail.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-04-14 11:56:42 -07:00
Kaiyu Xie
b286b51118
feat: Support torch profiler (#3470)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-14 22:06:06 +08:00
Yuan Tong
5985d362a9
fix: install RTC headers with linking when using --linking_install_binary (#3484)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-04-14 21:22:03 +08:00
Zhanrui Sun
714ff3eedd
chore: bump version to 0.19.0rc0 (#3535)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-14 18:11:20 +08:00
Robin Kobus
f58d4698c8
chore: Clean up cpp runtime (#3505)
* chore: Remove unused tensors from DecoderBuffers

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: Remove unused argument from readme

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: remove unused tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove unnecessary newOutputTokens

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove unnecessary event in getDecoderSlotHostOutputs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-14 18:00:03 +08:00
Zhanrui Sun
ee4ce0379d
chore: bump version to 0.19.0rc0 (#3514)
* chore: bump version to 0.19.0.rc0

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Update README

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-14 17:32:30 +08:00
Ivy Zhang
170bc22139
fix test name (#3534)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-04-14 17:09:50 +08:00
Zhenhuan Chen
c0493523d0
infra: add more codespell ignore words, prevent ans->and (#3513)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-04-14 17:00:30 +08:00
Kaiyu Xie
f99be2726f
doc: Add example section for multi-node DeepSeek R1 benchmark on GB200 (#3519)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-14 16:45:55 +08:00
Chuang Zhu
2e669133c2
disagg perf tune doc (#3516)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-14 16:05:06 +08:00
xinhe-nv
b1d8495b3d
update waive list (#3510)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-14 15:24:48 +08:00
dongjiyingdjy
2fb1d65d43
fix: fix max_seq_len in executor_config (#3487)
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
2025-04-14 15:13:29 +08:00
HuiGao-NV
9f41e826bf
fix: remove one duplicated line of code (#3523)
Signed-off-by: Hui Gao <huig@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-14 14:52:46 +08:00
bhsueh_NV
9d7d48faeb
fix: disable the kv cache reuse for prompt tuning test (#3474)
* disable the kv cache reuse for prompt tuning test

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* unwaive the wavied tests

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-14 14:35:47 +08:00
brb-nv
44090a5388
Add support for Phi-4-MM (#3296)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-14 14:24:10 +08:00
QI JUN
8d3e449a8d
reduce num layers in attention test (#3509)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-14 12:43:59 +08:00
Yiqing Yan
19d296b4b2
chore: add dgx_h200 tests (#3451)
* add dgx_h200 tests

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* test

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* fix pre-commit

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* fix

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* fix

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* change bsl branch

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* fix

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

* change multi gpu related file list

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

---------

Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-04-14 11:20:55 +08:00
yuxianq
9d64b6b890
Cache sin cos in model instead of global LRU cache. (#3378)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-14 11:19:09 +08:00
pcastonguay
fe6f14b2b1
fix: Fixing issue with first gen token being returned twice in streaming (#3427)
* fix: Fixing issue with first gen token being returned twice with streaming

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing not_expectring_strings in test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-13 22:45:09 -04:00
William Tambellini
af67bf00a8
feat: register ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343)
No change of default value (still ON).
These were hidden cmake vars before that patch.
Fix issue #3289

Signed-off-by: William Tambellini <wtambellini@sdl.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-04-14 10:30:23 +08:00
Chuang Zhu
75e13f4f88
chore: disable some env for disagg defaultly (#3415)
* disable some env for disagg defaultly

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* doc

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* remove

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-14 10:08:10 +08:00
yuxianq
baeec63dda
refactor: Remove _pp_forward. (#3496)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-14 09:49:44 +08:00
Yiqing Yan
65d1591fbf
Waive L0 test (#3508)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-14 09:32:01 +08:00
Chuang Zhu
6ee021a90d
chore: exchange connection id with tagSend/tagRecv (#3320)
* exchange connection id with tagSend/tagRecv

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* unwaive

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* tag recv/send

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-14 09:30:34 +08:00
HuiGao-NV
d0f83d19f1
fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache (#3497)
* fix: add kv memory size per token of draft model to calculate max number
of tokens of kv cache

Signed-off-by: Hui Gao 

* Fix code to get model_config of draft model

Signed-off-by: Hui Gao 

---------

Signed-off-by: Hui Gao
2025-04-13 23:02:14 +08:00
Yan Chunwei
b37c5c0a4d
make LLM-API slurm examples executable (#3402)
Signed-off-by: chunweiy <328693+Superjomn@users.noreply.github.com>
2025-04-13 21:42:45 +08:00
Aurelien Chartier
7b38018fa0
feat: Add numNodes to ParallelConfig (#3346)
* Add numNodes to ParallelConfig

If not provided, attempt to find the number of nodes by
adding the number of local ranks 0

Update device IDs check accordingly

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>

* Add ParallelConfig pickle test

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>

---------

Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-04-13 13:55:04 +02:00
dominicshanshan
5d3180be82
feat: Add stress test for TRT-LLM (#3250)
Signed-off-by: Wangshanshan <dominicw@nvidia.com>
2025-04-13 10:24:25 +08:00
Yan Chunwei
74850c61e9
fix: switch ZMQ from file socket to tcp socket in RemoteMpiCommSession (#3462)
* switch ZMQ from file socket to tcp

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix comment

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-13 09:15:55 +08:00
Robin Kobus
ceec4924d9
refactor: batch slot management in decoder classes (#3300)
* refactor: batch slot management in decoder classes

- Changed `forwardBatchSlots` from a single `TensorPtr` to a `std::vector<TensorPtr>` in `decoderBuffers.h` and updated its initialization in `decoderBuffers.cpp`.
- Updated `batchSlots` in `iGptDecoderBatched.h` to a `std::vector<TensorPtr>` for better handling of batch sizes.
- Modified `mBatchSlotsDecoder` in `statefulGptDecoderBatched.h` to use a `std::vector<TensorPtr>` and adjusted its initialization in `statefulGptDecoderBatched.cpp`.
- Ensured proper reshaping of tensors in the setup methods to accommodate the new vector structure.

These changes enhance flexibility in managing tensor buffers across different batch sizes.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Setup batch slots outside of the decoder

- Refactored batch slot management to utilize `makeBatchSlots`, enhancing clarity and functionality in batch processing.
- Introduced `DecoderState` to `MakeDecodingBatchInputOutput` for improved state handling during decoding.
- Updated the `operator()` method to include `decoderState` as a parameter, facilitating better integration with the decoding process.
- Modified related tests to accommodate changes in batch slot handling and ensure proper functionality.

These updates improve the overall structure and efficiency of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance decoder input structure with maxDecodingEngineTokens

- Updated the `Input` class in `iGptDecoderBatched.h` to include a new parameter `maxDecodingEngineTokens` for better control over decoding limits.
- Modified the `MakeDecodingBatchInputOutput` algorithm to compute the maximum number of decoding tokens based on active slots.
- Adjusted the `GptDecoderBatched` class to utilize the new `maxDecodingEngineTokens` parameter, improving clarity in token management during decoding.
- Updated Python bindings to reflect changes in the `Input` class constructor.
- Enhanced tests to ensure proper handling of the new parameter.

These changes improve the flexibility and efficiency of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Streamline decoder input creation and batch slot management

- Introduced a new function `createDecoderInputs` to encapsulate the logic for creating decoder inputs, improving code organization.
- Updated the `operator()` method to utilize the new `createDecoderInputs` function, simplifying the decoding input setup process.
- Removed the `maxOfActiveSlots` template function to streamline the logic for determining the maximum number of active decoding engine tokens.
- Introduced a direct calculation of `maxActiveDecodingEngineTokens` within the `createDecoderInputs` function, enhancing clarity and reducing complexity.

These changes enhance the maintainability and readability of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update logits handling in decoder batch

- Modified the `decoder_batch::Input` to accept a vector of vectors for logits, enhancing flexibility in tensor management.
- Adjusted the `createDecoderInputs` function to accommodate the new logits structure, ensuring proper batch processing.
- Updated Python bindings to reflect changes in the `Input` class constructor, maintaining compatibility with existing interfaces.
- Refactored the `GptDecoderBatched` and `StatefulGptDecoderBatched` classes to utilize the updated logits structure, improving clarity in tensor slicing and batch size management.
- Enhanced tests to validate the new input structure and ensure correct functionality across various decoding scenarios.

These changes streamline the decoding process and improve the overall maintainability of the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Rename maxDecodingEngineTokens to maxDecoderSteps

- Updated the `Input` class in `iGptDecoderBatched.h` to rename `maxDecodingEngineTokens` to `maxDecoderSteps` for improved clarity.
- Adjusted the `createDecoderInputs` function to reflect the new naming, ensuring consistency in the decoding process.
- Modified the `GptDecoderBatched` class to utilize `maxDecoderSteps` in its logic, enhancing readability and maintainability.
- Updated Python bindings to expose the renamed parameter, maintaining compatibility with existing interfaces.

These changes enhance the clarity of the decoding parameters and improve the overall structure of the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: remove usage of `active` vector from prepareForward

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Removed the `active` vector from `decoder_batch::Input`

- Removed the `active` vector from the `Input` class constructor in `iGptDecoderBatched.h`, streamlining the input handling for decoding.
- Updated the `createDecoderInputs` function and related tests to reflect the changes in the `Input` class, ensuring compatibility and maintaining functionality.
- Adjusted Python bindings to accommodate the new constructor signature, enhancing clarity in the interface.

These changes improve the maintainability and readability of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: remove usage of `active` vector from gptDecoderBatchedTest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Unify the creation of decoder batch inputs in algorithm and tests

- Added a new static method `createDecoderBatchInputs` to streamline the creation of decoder batch inputs, enhancing clarity and maintainability.
- Updated the implementation to utilize active slots directly, simplifying the logic for managing batch slots and logits.
- Refactored the `operator()` method to leverage the new input creation function, ensuring compatibility with existing decoding processes.
- Enhanced tests to validate the new input handling approach, ensuring correct functionality across various scenarios.

These changes improve the overall structure and readability of the decoding process in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: remove usage of active vector from createDecoderBatchInputs

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update maxDecoderSteps calculation

- Replaced integer division with `common::ceilDiv` for calculating `maxDecoderSteps` and `numDecoderSteps`, ensuring correct handling of token counts.

These changes enhance the robustness of the decoding batch input creation process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-13 05:05:13 +08:00
pcastonguay
145a126a28
chore: Unwaive DS + overlap disagg test (#3339)
* chore: Unwaive DS + overlap disagg test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing pre-commit

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing pre-commit

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-12 13:33:38 -04:00
WeiHaocheng
c6081abb0e
feat: Make scaffolding Controller more generic #3408 (#3416)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-12 21:35:38 +08:00
QI JUN
012fb9a1c4
remove useless max_num_tokens member in PyTorchConfig (#3493)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-12 21:09:58 +08:00
Robin Kobus
2ab71f9a80
refactor: decoder buffers (#3307)
* refactor: remove cumLogProbs and logProbs from DecoderBuffers

- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.

These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched

- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.

These changes enhance clarity and maintainability in the decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: integrate DecoderState for sequence length management in decoding process

- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.

refactor: update SlotDecoderBuffers to manage sequence lengths directly

- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.

These changes improve maintainability and streamline the decoding workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Delegate to asyncSend method in SlotDecoderBuffers

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-12 11:41:24 +02:00
yuxianq
29c5085400
fix: Fix PP for llama. (#3449)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-12 17:20:27 +08:00
Robin Kobus
1bd84c6d8c
feat: Allow individual gatherContext for each additional output (#3374)
* refactor: Update ExecutorConfig to use AdditionalModelOutput type

- Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output.
- Updated related serialization and deserialization methods to accommodate the new type.
- Adjusted tests to reflect the changes in the output handling structure.

This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove equality operator from TrtGptModelOptionalParams

- Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class.
- Updated the pybind11 bindings to remove the exposure of the equality operator to Python.

This change streamlines the class definition and reduces unnecessary complexity in the bindings.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput

- Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag.
- Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism.
- Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality.

This refactor improves the flexibility and clarity of output management in the batch processing workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce findOutputTensor utility function for output tensor retrieval

- Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity.
- Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity.
- Enhanced error checking for additional context and generation output tensors.

This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Check final indices of additional output tensors and update tests

- Added checks to verify the final indices of additional output tensors for context and generation outputs.
- Updated unit tests to verify the changes.
  - Add lastTokenIds input tensor to test engines.
  - Logits output depends on gatherContextLogits parameter.
- Removed gatherContextOutputs parameter from the validate method in LlmRequest.
  - Context outputs do not depend on computeContextLogits parameter.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Check final indices of additional output tensors and update tests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Remove equality operator from TrtGptModelOptionalParams

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* docs: Update executor.md

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Clean up includes

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-12 17:00:36 +08:00
nv-guomingz
adf60a8723
fix:update the default excluded_modules value for fp8rowwise recipe. (#3477)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-04-12 16:00:21 +08:00