Commit Graph

556 Commits

Author SHA1 Message Date
Fanrong Li
75d020cf07
fix: fix cuda graph padding for spec decoding (#4853)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-06 22:21:42 +08:00
Anthony Chang
eeb555e37b
chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826)
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-06 16:13:54 +08:00
QI JUN
ec50684d80
Revert "fix a bug of global cuda graph dummy request" (#4970) 2025-06-06 08:54:45 +08:00
QI JUN
bfa877a22e
Fix: fix autodeploy (#4957)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 21:06:55 +08:00
QI JUN
154f7cc40a
fix a bug of global cuda graph dummy request (#4894)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 19:47:40 +08:00
Shunkangz
3eae58ca36
Add disaggregated unittest (#4899)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-05 19:14:31 +08:00
QI JUN
b8c5e3892b
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 17:43:30 +08:00
Lucas Liebenwein
743fb0a159
[AutoDeploy] _AutoDeployLlmArgs as primary config object (#4891)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-05 17:20:55 +08:00
ixlmar
6437756da8
fix: handle OOMs during KV cache estimation (#4690)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-05 10:02:26 +02:00
Shiyu Li
b0d287c9b7
[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594)
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-06-04 18:26:13 -07:00
Yuxian Qiu
6b3242654e
fix: Fix broken vanilla moe since FusedMoE refactor. (#4897)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-05 03:56:41 +08:00
Yi Zhang
1fca654bfd
tests: Update gb200 test case (#4754)
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-04 18:49:20 +08:00
ixlmar
2bbb6b5976
chore: introduce KvCacheCreator (#4581)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-04 11:03:17 +02:00
Xianjie Qiao
325ccaae3d
Fix trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. (#4827)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-06-04 16:36:07 +08:00
Zhanrui Sun
35e87b99f3
chore: bump version to 0.21.0rc1 (#4896)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-06-04 14:31:18 +08:00
tomeras91
8d31e16877
[TRTLLM-4923][feat] Paged mamba cache (#4822)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-04 09:27:08 +03:00
Omer Ullman Argov
e71de2a13e
chore: Mass integration of release/0.20. (#4871)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-06-04 14:12:27 +08:00
Yan Chunwei
ac20159d32
fix: build_config in TorchLlmArgs and avoid invalid args (#4600)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-04 13:17:29 +08:00
QI JUN
e2eea80c1d
Chore: refine comments of prepare inputs method of model engine (#4837)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-04 12:14:13 +08:00
Yukun He
5fa6fbd989
feat: Enhance AutoTuner inference path and code readability (#4466)
Fix AutoTuner warmup request generating.
* The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases.
Refactor AutoTuner API and reduce host overhead.

Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition.
* Refine cache key assembly process to reduce host overhead and simplify API.
* Fix lru_cache usage to reduce host overhead.
* Move tuning config initialization as a one-time object in tunable runner to reduce host overhead.

Improve tuning config readability.
* Use dataclass to define tuning config.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-04 10:53:11 +08:00
Shi Xiaowei
b13f8c9cba
Fix: NVBug 5302895 (#4835)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-06-04 09:31:39 +08:00
Shunkangz
c835f06371
Refactor the first token response in PD (#4692)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-04 09:11:23 +08:00
Mike Iovine
73389d6531
[fix] Fix llama 4 long context (#4809)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-04 07:48:08 +08:00
Nikita Korobov
8043d7a03c
feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643)
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-06-03 14:07:54 -07:00
rakib-hasan
d0eb47d33a
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation (#4506)
* refactoring the multimodal input prep

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* adding out-of-tree override option

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* adding exceptional case for llava-next

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* fixing typo

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing review comments, adding placement option, handling tokenizer variations

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

* addressing pytest-asyncio behavior change

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>

---------

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-06-03 12:02:07 -07:00
hlu1
b4ed4b22f3
[Arch] Freeze model_config (#4814)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-04 02:51:35 +08:00
Yan Chunwei
80b4026775
chore: remove request_error ipc in LLM.submit (#4763)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 20:55:59 +08:00
pcastonguay
01f29ce38b
[nvbug 5294316] fix: Fix queued request stats (#4714)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-03 08:33:08 -04:00
Shunkangz
ae9a6cf24f
feat: Add integration of etcd (#3738)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Batsheva Black <bblack@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-06-03 20:01:44 +08:00
Enwei Zhu
3fe4a1842a
fix: Register MoeLoadBalancerConfig to serialization.py (#4864)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-03 19:22:36 +08:00
Frank
80f9989a1e
[enhanchment] Add beam width to low latency. (#4812)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-03 17:24:55 +08:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched (#4700)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling (#4532)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
hlu1
320195dc0d
[Architecture] Refactor FusedMoE (#4790)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-03 14:02:19 +08:00
Yuxian Qiu
ec796e44e4
feat: add heuristics for checkpoint files prefetching. (#4765)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-03 12:10:37 +08:00
Yan Chunwei
e013c8cbc2
fix [nvbug5256044]: bench hang due to llmapi ipc (#4798)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 10:10:53 +08:00
Fanrong Li
380a5d1690
[https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue (#4536)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Tian Zheng
9832787050
[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-06-03 10:00:17 +08:00
Yilin Fan
eb2d51a429
[fix] Fix llama4 min-latency mode (#4810) 2025-06-02 08:50:01 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Fanrong Li
7d356efc7d
fix: fix accuracy and illegal memory access issues when using mtp + attention dp (#4379)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 00:35:52 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
Lucas Liebenwein
491a09b0c6
[AutoDeploy] Increased Model Coverage Mass Migration Week 2 (#4817)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-06-01 14:40:29 +08:00
Enwei Zhu
0087bd27ba
[fix] Fix SamplingParams check on n and best_of (#4655)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 09:11:55 +08:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Enwei Zhu
25dde49c28
fix: EP load balancer with MTP layer and route offset by EP rank (#4767)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 00:07:44 +08:00
Yuxian Qiu
a02df6aa4b
fix: re-enable tp/pp for quickstart_advanced.py. (#4766)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-31 19:13:46 +08:00
Yan Chunwei
93c0632ee4
opt: the perormance for dist-agg streaming generation (#4214)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-31 17:40:32 +08:00
Mike Iovine
8cb6163a57
[fix] Fix Llama 3.3 70b EAGLE (#4772)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-30 10:08:08 -04:00
Yuxian Qiu
f82e44bbb9
fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. (#4758)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-30 18:04:08 +08:00