shaharmor98
ee4aab72ec
feat: Support PeftCacheManager in Torch ( #3186 )
...
* Add PeftCacheManager implementation
Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-04 12:38:08 +08:00
Fanrong Li
1fe64b90be
fix: fix the acceptance rate of pytorch workflow in trtllm-bench ( #3240 )
...
* fix acceptance rate of pytorch workflow.
* revert the RequestOutput API change.
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-03 15:12:24 +08:00
Jinyang Yuan
2fdfa39ea8
fix: Fix an error related to dummy request when MTP is used ( #3146 )
2025-04-03 11:08:12 +08:00
Anurag Mukkara
d998339855
Raise error for PP + MTP ( #3244 )
...
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-04-03 04:45:31 +08:00
QI JUN
bb10cdcfb8
chore: refine fetch new requests method ( #3213 )
...
* refine broadcast new requests method
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* refine fetch new requests method
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 10:46:00 +08:00
QI JUN
8fe2e5865e
refine broadcast new requests method ( #3198 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-02 08:05:20 +08:00
Jinyang Yuan
992d513bc6
feat: Optionally split MoE inputs into chunks to reduce GPU memory usage ( #3104 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-04-01 16:07:02 +08:00
dongjiyingdjy
22ff81b047
fix:fix illeagel memory access when mtp >= 2 ( #3006 )
...
* fix - fix illeagel memory access when mtp > 2
---------
Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-01 13:36:45 +08:00
jiahanc
c4ee14e43a
fix: Reverse cuda graph size order ( #3116 )
...
Signed-off-by: jiahanc <jiahanc@nvidia.com>
2025-04-01 11:28:36 +08:00
Aurelien Chartier
14e194433c
chore: cleanup py_executor code ( #3132 )
...
* chore: cleanup py_executor code
* Add common loop cleanup function
* Remove checks for attention DP if nothing to queue
* Remove extra return statements
* Remove extra variables
* Remove commented debug print
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
* rename cleanup function
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
---------
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-04-01 09:27:04 +08:00
Anurag Mukkara
435cd2983d
perf: Optimisations for PP + attention DP ( #3134 )
...
* Minor tp_rank fix
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Delete unused function
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* PP broadcast for ADP new requests
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Sync request finish point for intermediate and last pp ranks
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Use local PP layers only for KV cache estimation
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
---------
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-04-01 08:59:16 +08:00
Mike Iovine
5416966ddb
Add initial EAGLE-3 implementation ( #3035 )
...
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-03-29 22:31:24 +08:00
Erin
c75d7cd684
move BuildConfig functional args to llmargs ( #3036 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-03-29 02:20:18 +08:00
Aurelien Chartier
3de82c41cd
Pytorch PP + attention DP support ( #3044 )
...
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-03-28 00:11:19 +08:00
Fanrong Li
0976360204
add support for MTP+cuda_graph_padding. ( #3096 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-03-27 16:06:14 +08:00
Jinyang Yuan
6b583f6f83
perf: Enable CUDA graphs when attention DP is used and active requests on different GPUs are uneven ( #3010 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-03-26 21:09:25 +08:00
HuiGao-NV
25f2434495
fix: Set correct draft_token_nums to dummy requests for torch compilation with MTP ( #3053 )
...
Set correct draft_token_nums to dummy requests for torch compilation with MTP
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-03-26 11:32:57 +08:00
yuxianq
268933b5cc
Refactor imports inside tensorrt_llm._torch. ( #3015 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-03-26 11:01:07 +08:00
Aurelien Chartier
ef78518310
Only gather responses on rank 0 ( #3040 )
...
Signed-off-by: Aurelien Chartier <achartier@nvidia.com>
2025-03-24 21:54:51 -07:00
Netanel Haber
da0b0e0ee3
fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size ( #2983 )
...
* fix variable window size reuse - disable when *min attention window* starts sliding, not max
* isPreCyclic -> isCyclic, and invert logic, for clarity
* getDecoderState()
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-03-24 22:49:52 +08:00
Kaiyu Xie
2631f21089
Update ( #2978 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-23 16:39:35 +08:00
Kaiyu Xie
3aa6b11d13
Update TensorRT-LLM ( #2936 )
...
* Update TensorRT-LLM
---------
Co-authored-by: changcui <cuichang147@gmail.com>
2025-03-18 21:25:19 +08:00
Kaiyu Xie
9b931c0f63
Update TensorRT-LLM ( #2873 )
2025-03-11 21:13:42 +08:00
Kaiyu Xie
77d7fe1eb2
Update TensorRT-LLM ( #2849 )
...
* Update TensorRT-LLM
---------
Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00
Kaiyu Xie
ab5b19e027
Update TensorRT-LLM ( #2820 )
2025-02-25 21:21:49 +08:00
Kaiyu Xie
2ea17cdad2
Update TensorRT-LLM ( #2792 )
...
* Update TensorRT-LLM
---------
Co-authored-by: jlee <jungmoolee@clika.io>
2025-02-18 21:27:39 +08:00
Kaiyu Xie
e88da961c5
Update TensorRT-LLM ( #2783 )
2025-02-13 18:40:22 +08:00
Dan Blanaru
16d2467ea8
Update TensorRT-LLM ( #2755 )
...
* Update TensorRT-LLM
---------
Co-authored-by: Denis Kayshev <topenkoff@gmail.com>
Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>
Update
2025-02-11 03:01:00 +00:00