Commit Graph

1164 Commits

Author SHA1 Message Date
Enwei Zhu
3fe4a1842a
fix: Register MoeLoadBalancerConfig to serialization.py (#4864)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-03 19:22:36 +08:00
Frank
80f9989a1e
[enhanchment] Add beam width to low latency. (#4812)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-03 17:24:55 +08:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched (#4700)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling (#4532)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
hlu1
320195dc0d
[Architecture] Refactor FusedMoE (#4790)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-03 14:02:19 +08:00
Iman Tabrizian
141467d4b6
Add pre-merge Triton backend tests (#4842)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-03 00:47:58 -04:00
ruodil
fa93eeee84
shorten reqs in con:1 cases and add streaming cases, and add l2 perf … (#4849)
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-03 12:28:13 +08:00
Ivy Zhang
8686868531
tests: [TRTQA-2905] improve timeout report for qa test cases (#4753)
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-03 12:27:27 +08:00
Yuxian Qiu
ec796e44e4
feat: add heuristics for checkpoint files prefetching. (#4765)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-03 12:10:37 +08:00
WeiHaocheng
7ce1e1311f
[TRTLLM-5340] fix: remove the accuracy assert on run_majority_vote_aime24.py (#4784)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-06-03 10:41:03 +08:00
Robin Kobus
e34a1beb72
[nvbugs/5303555] ci: unwaive test_fp8_block_scales_cuda_graph_padding (#4735)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 10:40:43 +08:00
Yan Chunwei
e013c8cbc2
fix [nvbug5256044]: bench hang due to llmapi ipc (#4798)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 10:10:53 +08:00
Fanrong Li
380a5d1690
[https://nvbugs/5271281][fix] fix a pd+mtp accuracy issue (#4536)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Tian Zheng
9832787050
[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-06-03 10:00:17 +08:00
yunruis
4e2fefc076
upgrade cutlass to 4.0 (#4794)
Signed-off-by: yunruis <yunruis@nvidia.com>
2025-06-03 09:55:02 +08:00
Po-Wei (Vincent)
9ae2ce6665
[TRTLLM-5502][infra] Add github action to identify if PR is from community (#4824)
Signed-off-by: Po-Wei Wang (Vincent)
2025-06-03 06:36:35 +08:00
Yilin Fan
90aab0596e
[fix] Fix Llama4 guradwords failures (#4844)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-02 13:43:42 -07:00
Fanrong Li
13f68338d2
fix: [https://nvbugspro.nvidia.com/bug/5273945] Unwaive tests for bug-5273945 (#4832)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 22:01:57 +08:00
Yanchao Lu
8166649d03
[Infra] - Minor clean-up and test Ubuntu mirrors (#4829)
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-02 20:18:20 +08:00
Yilin Fan
eb2d51a429
[fix] Fix llama4 min-latency mode (#4810) 2025-06-02 08:50:01 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Fanrong Li
7d356efc7d
fix: fix accuracy and illegal memory access issues when using mtp + attention dp (#4379)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 00:35:52 +08:00
Netanel Haber
2ce05c3ab4
'entered copyBlock' format string expects %s, pass string rather than int (#4820)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-01 08:54:33 -07:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
amirkl94
8039ef45d3
CI: Performance regression tests update (#3531) 2025-06-01 09:47:55 +03:00
Lucas Liebenwein
491a09b0c6
[AutoDeploy] Increased Model Coverage Mass Migration Week 2 (#4817)
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-06-01 14:40:29 +08:00
Emma Qiao
202813f054
Check test names in waive list (#4292)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-01 14:39:30 +08:00
Enwei Zhu
0087bd27ba
[fix] Fix SamplingParams check on n and best_of (#4655)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 09:11:55 +08:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Enwei Zhu
25dde49c28
fix: EP load balancer with MTP layer and route offset by EP rank (#4767)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 00:07:44 +08:00
Dom Brown
338d6e9f95
[nvbug 5305210] fix: Resolve nvbug 5305210 (#4759)
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-31 19:21:06 +08:00
Yuxian Qiu
a02df6aa4b
fix: re-enable tp/pp for quickstart_advanced.py. (#4766)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-31 19:13:46 +08:00
Yan Chunwei
93c0632ee4
opt: the perormance for dist-agg streaming generation (#4214)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-31 17:40:32 +08:00
Emma Qiao
c945e92fdb
[Infra]Remove some old keyword (#4552)
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-31 13:50:45 +08:00
Mike Iovine
8cb6163a57
[fix] Fix Llama 3.3 70b EAGLE (#4772)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-30 10:08:08 -04:00
juney-nvidia
49f2f1f8eb
Expose new tech blog about DSR1 throughput optimization to the main R… (#4803)
Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>
2025-05-30 20:44:12 +08:00
Tao Li @ NVIDIA
3b7120d60e
DeepSeek R1 throughut optimization tech blog for Blackwell GPUs (#4791)
Signed-off-by: Tao Li
2025-05-30 18:54:19 +08:00
Yuxian Qiu
f82e44bbb9
fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. (#4758)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-30 18:04:08 +08:00
Pengyun Lin
bac22ff7b5
[feat] support sharegpt downloading in benchmark_serving (#4578)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-30 17:27:53 +08:00
QI JUN
99fdef20c4
[TRTLLM-5516] perf: replicate dummy request for cuda graph padding (#4729)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-30 17:14:23 +08:00
ixlmar
c026dda400
fix: iteration logging and typing in PyExecutor (#4734)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 11:01:20 +02:00
ixlmar
7e6d06d5d7
feat: estimate GPU mem. usage w/ minimal KV cache (#4574)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-30 10:40:45 +02:00
Zheng Duan
54200ee8ac
fix: random fail of cache router test (#4597)
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-30 16:28:19 +08:00
Chuang Zhu
f117d6abe9
Fabric Memory for KV Cache Transfer (#4717)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-30 15:50:21 +08:00
Enwei Zhu
ee916da8f1
test: Waive test_llm_loading_from_ckpt_for_tp2 (#4797)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-30 15:43:00 +08:00
xinhe-nv
53794b26f8
test: skip test_llm_hf_gemma_quantization_1gpu_vswa on A100 (#4779)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-30 15:12:12 +08:00
Thor Johnsen
55d56f8155
[JIRA-5226219][fix] Fix Bug in KV cache manager (#4596)
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
2025-05-29 22:03:20 -07:00
Aurelien Chartier
36b87b8671
chore: fix llm_root when LLM_ROOT is not set (#4741)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-29 19:44:34 -07:00
juney-nvidia
fe359d9df9
Added code owners for AutoDeploy (#4769)
Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>
2025-05-30 09:55:27 +08:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00