Guoming Zhang
ef676fc71f
[ https://nvbugs/5513192 ][fix] Add the missing param for kv_cache_tran… ( #7679 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-11 19:00:16 +08:00
QI JUN
656f229b58
[None][ci] move some test cases from l40s to a30 ( #7684 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-11 07:22:34 +08:00
Emma Qiao
9986070044
[None][infra] Waive failed cases on main 0910 ( #7676 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-11 01:43:29 +08:00
Dom Brown
fc9d426589
[ https://nvbugs/5505402 ] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues ( #7616 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-09-10 18:30:48 +01:00
nvamyt
222e01662c
[ https://nvbugs/5488212 ][waive] Waive failed tests for L20 ( #7664 )
...
Signed-off-by: nvamyt <amyt@nvidia.com>
2025-09-10 22:32:15 +08:00
xinhe-nv
207c5258c4
[ https://nvbugs/5494698 ][fix] skip gemma3 27b on blackwell ( #7505 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-10 21:09:27 +08:00
Bo Deng
bf57829acf
[TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. ( #7503 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-09-10 17:35:51 +08:00
Frida Hou
bbb5ae3349
[ #5861 ][autodeploy] Refactor: Quantization Transforms with Inheritance ( #7227 )
...
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-09-10 13:00:06 +08:00
Zheyu Fu
c353ff342e
[None][feat] Make the should_use_spec_decode logic a bit smarter ( #7112 )
...
Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>
2025-09-10 12:53:59 +08:00
fredricz-20070104
ef620f3579
[ https://nvbugs/5410687 ][test] Add deepseek r1-w4afp8 quickstart ( #7645 )
...
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
2025-09-10 10:21:01 +08:00
Guoming Zhang
beefd6413e
[None][fix] fix post-merge issue raised by #5488 ( #7655 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-10 09:26:27 +08:00
Chang Liu
faa2f46554
[TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next ( #7349 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-09 14:51:36 -04:00
Jin Li
d49374bc45
[TRTLLM-7408][feat] Wrap MOE with custom op. ( #7277 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-09 12:18:56 -04:00
QI JUN
a0e1604898
[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline ( #7629 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-09 11:06:32 -04:00
Liao Lanyu
af403848d7
[ https://nvbugs/5445466 ][fix] unwaive DS R1 test cases with bug already fixed ( #7429 )
...
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
2025-09-09 17:25:49 +08:00
Perkz Zheng
da6cb541a2
[None][feat] Optimize MLA kernels with separate reduction kernels ( #7597 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-09-09 16:58:44 +08:00
xinhe-nv
8a52015f50
[None][chore] Remove closed bugs ( #7591 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-09 04:08:42 -04:00
William Zhang
c53d1814a7
[None][feat] Extend VLM factory and add Mistral3 factory ( #7583 )
...
This commit:
* extends existing factory interfaces to enable Mistral3 in AutoDeploy.
* adds a Mistral3 VLM factory.
* adds various model patches for pixtral (the vision model) and mistral3
to make the VLM export compliant.
* adjusts checkpoint loading code to take possible parameter name
conversions into account.
* fixes a sampling bug (the `end_id` needs to be take into account when
sampling, but it is not included in the stop words' token IDs).
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-09-09 02:47:18 -04:00
Yiqing Yan
5c616da2fd
[TRTLLM-5877][infra] Add fmha tests and auto trigger rules ( #6050 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-09-09 11:33:09 +08:00
Wanli Jiang
1e0669d27a
[ https://nvbugs/5453709 ][fix] Remove transformers version limit in Qwen2VL ( #7152 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-09 10:38:20 +08:00
Iman Tabrizian
d96c54d8ae
[None][test] Skip eagle3 test ( #7627 )
...
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2025-09-08 17:23:53 -04:00
dongfengy
fdd5bd49fc
[ https://nvbugs/5481080 ][fix] Fix GPTOSS W4A16 reference ( #7323 )
...
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-09-08 13:59:28 -07:00
Chuang Zhu
77657a1c12
[TRTLLM-7361][feat] KV cache transfer for uneven pp ( #7117 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-08 13:37:46 -04:00
Eran Geva
5f2a42b3df
[TRTLLM-6142][feat] AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored ( #7219 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-08 08:45:58 -04:00
Chang Liu
4a1e13897f
[None][feat] Update multimodal utility get_num_tokens_per_image for better generalization ( #7544 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-08 07:42:46 -04:00
bhsueh_NV
219e95569a
[ https://nvbugs/5506683 ][fix] adjust the CI ( #7604 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-09-08 15:41:41 +08:00
dominicshanshan
c9dca69e1b
[None][chore] Mass integration of release/1.0 - 3rd ( #7519 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Nave Assaf <55059536+Naveassaf@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Co-authored-by: yifeizhang-c <219273404+yifeizhang-c@users.noreply.github.com>
Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com>
Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: milesial <milesial@users.noreply.github.com>
Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Co-authored-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com>
Co-authored-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-09-08 14:03:04 +08:00
JunyiXu-nv
504bb7ffa9
[TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API ( #7508 )
...
Signed-off-by: Junyi Xu
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-09-08 11:11:35 +08:00
Raayan Dhar
8f3121ac81
[None][fix] chore: fixing the math on asymmetric tp+pp tests ( #7098 )
...
Signed-off-by: raayandhar <rdhar@nvidia.com>
2025-09-07 14:27:46 -04:00
Netanel Haber
0fee8cd028
[TRTLLM-7153] [feat] Move stop_criteria to sample_async ( #7041 )
...
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-09-07 17:36:49 +03:00
Raayan Dhar
bae9560e62
[ https://nvbugs/5448767 ][fix] sync termination of requests across PP ranks ( #7455 )
...
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-07 08:45:49 -04:00
Emma Qiao
aea8ac1649
[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base ( #7086 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-07 14:26:18 +08:00
Mike Iovine
45390402fc
[ https://nvbugs/5502352 ][fix] Fix 2-model CDL path ( #7543 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-06 23:53:27 -04:00
Chang Liu
99b98f1374
[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync ( #7280 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-09-06 23:11:39 -04:00
Chang Liu
23500b55c3
[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse ( #7106 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-06 17:58:32 -04:00
QI JUN
12ecb864c2
[None][chore] share input_ids buffers among different cuda graphs ( #7236 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-06 17:49:42 -04:00
dominicshanshan
9a97f0a3b7
[None][ci] Waive qwen3 test for accuracy bug in https://nvbugs/5505402 ( #7585 )
...
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-06 21:29:16 +08:00
QI JUN
525bb806a9
[None][ci] move some test cases of DGX H100 to post merge ( #7569 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-06 01:03:38 -04:00
QI JUN
b8183cac2b
[None][ci] Revert "[ https://nvbugs/5461761 ][fix] Remove the waiver ( #7476 )" ( #7584 )
2025-09-05 22:02:09 -07:00
Lucas Liebenwein
74105a45d9
[ #6120 ][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example ( #7221 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-09-05 22:10:48 -04:00
peaceh-nv
25389c9fe2
[ https://nvbugs/5453806 ][unwaive] Unwaive fp8 kvcache attention test ( #7243 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-09-05 12:13:57 -04:00
Emma Qiao
d8ec546b73
[None][infra] Waive failed tests on main branch 0905 ( #7564 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-05 22:46:46 +08:00
Ziyi Xiong
79e0296ca0
[ https://nvbugs/5461761 ][fix] Remove the waiver ( #7476 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-09-05 15:29:54 +08:00
xinhe-nv
8e3962d278
[TRTLLM-6642][feat] add gptoss 20g tests ( #7361 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-05 02:20:28 -04:00
xinhe-nv
b3ba3d98d2
[None][chore] Remove closed bugs ( #7408 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-09-05 02:11:16 -04:00
QI JUN
ff3704897b
[None][ci] remove unnecessary test_modeling_deepseek.py ( #7542 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-04 20:05:27 -07:00
Jin Li
2189a2f3ff
[ https://nvbugs/5483615 ][fix] Remove unnecessary assertion to let mai… ( #7441 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-05 10:56:21 +08:00
Shunkangz
bddf183e15
[None][feat] Add Request specific exception ( #6931 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-09-04 18:43:42 -04:00
Chang Liu
08a0e06621
[TRTLLM-7410][feat] Support hashing and KV cache reuse for videos ( #7360 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-04 14:39:23 -04:00
Yuxian Qiu
48a5270868
[ https://nvbugs/5492485 ][fix] Use offline dataset from llm-models instead. ( #7435 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-04 09:58:16 -07:00
sychen52
98a1bffb7c
[OMNIML-2336][feat] Add NVFP4 x FP8 ( #6809 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-04 09:03:38 -07:00
Enwei Zhu
1745102e72
[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec ( #7481 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-04 23:30:14 +08:00
Izzy Putterman
26b133f3a7
[None][feat] MultiLayer Eagle ( #7234 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-09-04 10:49:13 -04:00
Ivy Zhang
b46e0ae5d4
[None][test] update nim and full test list ( #7468 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-09-04 09:06:01 -04:00
QI JUN
d38b8e3dd9
[None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests ( #7489 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-04 06:04:51 -07:00
kris1025
cce9556858
[ https://nvbugs/5485886 ][fix] Fix resource free of Eagle3ResourceManager ( #7437 )
...
Signed-off-by: linquanh <linquanh@nvidia.com>
2025-09-04 17:38:13 +08:00
Grzegorz Kwasniewski
3755f8ab7d
[TRTLLM-6342][fix] Fixed triggering BMM sharding ( #7389 )
...
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-09-04 02:01:27 -04:00
Jin Li
2a2dfe273b
[ https://nvbugs/5485102 ][fix] Correctly set stride for piecewise outp… ( #7442 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-09-04 10:48:15 +08:00
Stanley Sun
db8eb0a447
[TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options ( #7492 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-09-04 10:34:38 +08:00
Lizhi Zhou
d97c1e6bd9
[ https://nvbugs/5470769 ][fix] fix disagg-serving accuracy test case ( #7338 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-04 09:11:01 +08:00
Enwei Zhu
5ff3a65b23
[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) ( #6948 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-03 15:16:11 -07:00
Lizhi Zhou
7c73c2ff4b
[ https://nvbugs/5485593 ][fix] improve accuracy/test_disaggregated_serving.py ( #7366 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-09-03 09:38:53 -04:00
Stanley Sun
cebbf48b74
[TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 ( #7083 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-09-03 08:36:52 -04:00
Mike Iovine
79d93f9419
[ https://nvbugs/5488141 ][fix] Unwaive llama3 test_eagle3 ( #7486 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-03 14:10:40 +08:00
Wanli Jiang
4223a9aada
[TRTLLM-7261][feat] Support phi-4 model in pytorch backend ( #7371 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-03 10:27:42 +08:00
Daniel Stokes
109f27265c
[None][perf] Add MOE support for dynamic cluster shapes and custom epilogue schedules ( #6126 )
...
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
2025-09-02 21:54:43 -04:00
Eran Geva
75c1bb6389
[ https://nvbugs/5458798 ][fix] Disabled test_trtllm_bench_backend_comparison due to timeout ( #7397 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-02 11:21:42 -07:00
Simeng Liu
bcc55bcdf3
[ https://nvbugs/5470782 ][fix] Add specific test names for test_deepseek.py ( #7318 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-09-02 10:31:40 -07:00
Emma Qiao
aae5d22bfe
[None][infra] Waive failed tests on main branch 0902 ( #7482 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-02 10:16:49 -04:00
peaceh-nv
90479c50fb
[ https://nvbugs/5453992 ][unwaive] Unwaive llama quickstart test ( #7242 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-09-02 20:28:32 +08:00
JunyiXu-nv
eefe5f2093
[TRTLLM-7208][feat] Implement basic functionalities for Responses API ( #7341 )
...
Signed-off-by: Junyi Xu <junyix@nvidia.com>
2025-09-02 07:08:22 -04:00
HuiGao-NV
7279297717
[None][infra] waive test case failed on post-merge ( #7471 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-02 06:20:08 -04:00
aalanwyr
c3c95736a1
[TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test ( #7413 )
...
Signed-off-by: Yaran Wu <28771492+aalanwyr@users.noreply.github.com>
2025-09-02 17:21:27 +08:00
Ivy Zhang
3799e5d460
[None][test] auto reuse torch empty cache on qa test ( #7421 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-09-02 04:44:47 -04:00
Yan Chunwei
f90375f37c
[ https://nvbugs/5476580 ][fix] unwaive test_nvfp4_4gpus ( #7454 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-09-02 04:17:14 -04:00
Mike Iovine
b3c57a7042
[TRTLLM-7353][feat] Implement capturable drafting loops for speculation ( #7100 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-01 14:37:44 -04:00
Emma Qiao
01dfd3af1b
[None][infra] Waive failed case on main 0901 ( #7447 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-01 23:27:24 +08:00
bhsueh_NV
16e9d1121c
[ https://nvbugs/5481087 ][fix] fix bug of ci when we use mocker ( #7332 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-09-01 16:22:45 +08:00
nvamyt
efaefca2c8
[None][test] Update case that not support passing quantization fp8 for pytorch backend ( #7302 )
...
Signed-off-by: nvamyt <amyt@nvidia.com>
2025-09-01 12:59:21 +08:00
Yiqing Yan
21291f3d8e
[None][chore] Remove duplicate test waives ( #6999 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Emma Qiao
09bca7ca82
[None][infra] Waive failed tests for release branch 0818 ( #6993 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
peaceh-nv
f4dc1ed39c
[ https://nvbugs/5449218 ][fix] Fix KvCacheConfig error in test_perf ( #6937 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Ivy Zhang
29cdcdb56a
[None][fix] update skip config ( #6891 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Guoming Zhang
d5bc5cd4f2
[ https://nvbugs/5375646 ][fix] update waives.txt for nvbug 5375646 ( #6847 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
William Zhang
d15dcdc4ae
[ https://nvbugs/5448525 ][fix] Mistral Small 3.1 accuracy tests ( #6909 )
...
This commit lowers the GPU memory allocated for KV cache in accuracy
tests, and adjusts a threshold for Mistral Small 3.1 24B for FP8.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Yan Chunwei
ac07418968
[None][ci] unwaive test_ptp_star_attention_example ( #6943 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
xinhe-nv
b4d41d6604
[TRTLLM-7048][feat] add benchmark TRT flow test for MIG ( #6884 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Yan Chunwei
612c26be22
[None][doc] add legacy section for tensorrt engine ( #6724 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
2ez4bz
cf0c47ca2d
[None][fix] Fix batching bug in Mistral3 model ( #6841 )
...
Prior to this commit, if multiple requests with images were in the same
batch, the batching logic for the images would fail.
This commit fixes it, and adds unit tests for it that were verified to
fail prior to the fix.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
2ez4bz
2480aedb73
[TRTLLM-5252][feat] Add fp8 support for Mistral Small 3.1 ( #6731 )
...
This commit adds some level of FP8 support to Mistral Small 3.1 by:
* disabling quantization for the vision sub-model since `modelopt` does
support quantizing it (yet).
* extending existing accuracy tests to use a modelopt produced FP8
checkpoint.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Guoming Zhang
3e99744201
[ https://nvbugs/5375594 ][fix] fix oom issue on structural_tag test case ( #6838 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Ivy Zhang
deba2885c1
[None][fix] fix Llama3 eagle3 test case OOM ( #6832 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
xinhe-nv
7841ea6255
[None][chore] waive GB300 known issues ( #6812 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Ivy Zhang
c7147d25dc
[TRTLLM-6975][test] Add multi-turn test cases for VLM models ( #6749 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-01 11:02:31 +08:00
Tian Zheng
e257cb3533
[None][feat] Support NVFP4 KV Cache ( #6244 )
...
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-09-01 09:24:52 +08:00
xinhe-nv
5f939b9121
[None][chore] Add failed cases into waives.txt ( #7342 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-30 00:49:14 -04:00
Emma Qiao
15ec2b855d
[None][infra] Waive failed tests on main branch 08/29 ( #7370 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-29 10:28:20 -04:00
Pengbo Wang @ NVIDIA
62459d533d
[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss ( #7192 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: Pengbo Wang @ NVIDIA <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-29 17:03:46 +08:00
fredricz-20070104
091b67ad2f
[TRTLLM-7280][test] Add beam search CudaGraph + Overlap Scheduler tests ( #7326 )
...
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
2025-08-29 02:16:22 -04:00
Chang Liu
31b0f0fb0c
[ https://nvbugs/5445466 ][fix] Eliminate race when loading HF dynamic modules ( #7268 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-08-29 12:36:30 +08:00
Richard Huo
ce580ce4f5
[None][feat] KV Cache Connector API ( #7228 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Co-authored-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-08-28 23:09:27 -04:00
aalanwyr
085dc19bfa
[TRTLLM-6646][test] NIM migration to TRT-LLM LLMAPI : Add QWQ-32b torch test ( #7284 )
...
Signed-off-by: Yaran Wu <28771492+aalanwyr@users.noreply.github.com>
2025-08-28 23:09:11 -04:00
Yuan Tong
ccb800f909
[TRTLLM-7457][ci] Update unittest parallel config ( #7297 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-29 09:28:04 +08:00
Emma Qiao
1e644fa28a
[None][infra] Waive failed tests on main branch 08/26 ( #7346 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-29 00:24:08 +08:00
Neta Zmora
08f935681d
[ https://nvbugs/5474453 ][fix] fix path to tested model ( #7272 )
...
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
2025-08-28 08:01:48 -04:00
Zongfei Jing
53163bf1df
[TRTLLM-6876][feat] Add low precision all2all for mnnvl ( #7155 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-08-28 18:26:16 +08:00
QI JUN
ae89163368
[None][ci] skip TestGPTOSS ( #7333 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-28 05:01:49 -04:00
William Zhang
4541655e5f
[ https://nvbugs/5430124 ][ci] Unwaive Mistral 3.1 Small tests ( #7274 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-08-28 00:03:32 -04:00
QI JUN
39c9ffda5a
[None][ci] fix test list name ( #7321 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-27 22:33:22 -04:00
Pengyun Lin
c1e7fb9042
[TRTLLM-7207][feat] Chat completions API for gpt-oss ( #7261 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-08-28 10:22:06 +08:00
bhsueh_NV
9d345b31c0
[ https://nvbugs/5453727 ][fix] unwaive qwen3 CI tests ( #7293 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-27 22:58:59 +08:00
Eran Geva
462169bfc9
[ https://nvbugs/5458798 ][fix] AD perf test outliers handling, tightened threshold, re-enabled in CI, fixed mem threshold ( #7189 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-08-27 07:57:46 -07:00
QI JUN
d09add5ede
[None][ci] parallelize unit tests of auto deploy in B200 ( #7291 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-27 22:32:11 +08:00
Emma Qiao
8dc62ffac4
[None][infra] Waive failed tests on main ( #7300 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-27 09:53:33 -04:00
xinhe-nv
f082e4857c
[TRTLLM-7250][fix] waive failed cases ( #7292 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-27 18:04:46 +08:00
nvamyt
dbd4f21687
[None][fix] Update maxnt of llama_v3.2_1b bench ( #7279 )
...
Signed-off-by: nvamyt <amyt@nvidia.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-27 16:56:28 +08:00
bhsueh_NV
f167b1fd99
[ https://nvbugs/5453727 ][fix] Fix bug of how GPT-OSS setup the parameters in CI ( #7151 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-27 15:26:10 +08:00
QI JUN
e08c7cf17b
[None][ci] remove test_llm_api_autodeploy from B200 test db ( #7282 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-27 03:12:30 -04:00
dongxuy04
abdb2735be
[None][fix] Fix possible hang issue in WideEP and move some tests to pre-merge ( #7262 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-08-27 01:39:24 -04:00
Yuan Tong
6c7813e821
[TRTLLM-7457][ci] Update & cleanup unittest parallel config ( #7254 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-08-27 00:45:58 -04:00
Zhenhuan Chen
d0d8903a7f
[TRTLLM-6960][fix] replace flasky scaled_mm test with more stable config ( #7089 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-08-26 20:58:33 -07:00
Shunkangz
ff4047414b
[None][opt] Balance the request based on number of tokens in AttentionDP ( #7183 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-27 11:16:12 +08:00
Zhou Yuxin
ccb6aadea8
[ https://nvbugs/5412456 ][fix] Remove from waives.txt ( #7248 )
...
Signed-off-by: Zhou Yuxin <yuxinz@nvidia.com>
2025-08-27 10:05:53 +08:00
Jin Li
028235404b
[TRTLLM-6633][feat] Padding for piecewise cudagraph ( #6750 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-26 18:31:33 -04:00
Fridah-nv
0f947c64cb
[None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder ( #7233 )
...
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-08-26 10:47:57 -07:00
Void
040f4c70d3
[None][perf] Accelerate global scale calculations for deepEP fp4 combine ( #7126 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-08-27 00:13:13 +08:00
QI JUN
baef70e67e
[None][ci] move qwen3 tests from b200 to gb200 ( #7257 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-26 11:50:53 -04:00
xinhe-nv
80043affb5
[None][chore] Add failed cases into waives.txt ( #7251 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-26 17:13:44 +08:00
amitz-nv
23ed0c892d
[ https://nvbugs/5477332 ][fix] Relax atol in test_mamba2_chunk_scan_combined_prefill_chunking ( #7215 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-26 10:48:58 +03:00
Zheng Duan
cf50ba2980
[TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server ( #6985 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-08-26 15:34:44 +08:00
Zheng Duan
1a929a1490
[ https://nvbugs/5457504 ][fix] fix kv cache event test in disaggregated worker tests ( #7028 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-08-26 14:25:10 +08:00
nvamyt
d8bd8843fc
[None][test] Update qwen3 timeout to 60 minutes ( #7200 )
...
Signed-off-by: nvamyt <amyt@nvidia.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-26 14:18:42 +08:00
qixiang-99
b165f8bc97
fix/improve kvcache allocation in PyTorch runtime ( #5933 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-08-26 12:40:22 +08:00
William Zhang
92576488d3
[None][feat] Skip prefetching consolidated safetensors when appropriate ( #7013 )
...
* Why?
Some models (e.g. anything produced by Mistral) can have both sharded
safetensors and a consolidated safetensor in the same checkpoint
directory. In such cases, prefetching both to memory is a waste of time,
and memory.
* What?
This commit skips over consolidated safetensors when they are not the
only safetensor file present in the checkpoint directory
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-08-25 23:56:21 -04:00
Leslie Fang
20922b7d1f
[None][chore] Create PyExecutor from TorchLlmArgs Part 1 ( #7105 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-26 10:42:01 +08:00
ruodil
b845eb7a3a
[None][test] add kv cache size in bench metric and fix failed cases ( #7160 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-26 10:10:02 +08:00
Grzegorz Kwasniewski
2101d46d68
[TRTLLM-6342][feat] TP Sharding read from the model config ( #6972 )
...
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-25 15:41:27 -07:00
chenfeiz0326
6a44e5b9d1
[ https://nvbugs/5440241 ][fix] Fix 70B GSM8K Accuracy drop ( #6967 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-08-25 22:09:30 +08:00
Emma Qiao
200db3b809
[None][infra] Waive failed tests on main branch ( #7201 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-25 09:04:37 -04:00
QI JUN
bea5e07fb7
[None][refactor] refactor the CUDA graph runner to manage all CUDA graphs ( #6846 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-25 20:52:05 +08:00
amitz-nv
a1e03af0f4
[TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests ( #7033 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-25 10:37:40 +03:00
Ivy Zhang
f61b74f796
[None][test] add l20 specific qa test list ( #7067 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-25 12:44:08 +08:00
QI JUN
630e67b845
[None][ci] waive test_mamba2_chunk_scan_combined_prefill_chunking[seqlens1-8] ( #7194 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-24 23:52:59 -04:00
Yukun He
9c5b464fe0
[None][feat] Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time. ( #7113 )
...
Because deep_gemm.gp8_gemm_nt will trigger many JIT processes during the inference phase, we need to sweep these shapes ahead of time. Apply the AutoTuner framework to achieve this and retain the potential capability to tune the swap_ab flag.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-25 10:48:31 +08:00
Bo Deng
c038fb3ef4
[None][chore] cherry-pick 6940 ( #7097 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-25 10:28:45 +08:00
xinhe-nv
3ba9afcc7b
[None][feat] add gpt-osss tests to sanity list ( #7158 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-25 10:22:07 +08:00
Bo Deng
6e131602b2
[TRTLLM-7096][infra] Testing cache transmission functionality in Python ( #7025 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-25 09:47:39 +08:00
Yiqing Yan
486bc763c3
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge ( #7074 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:09:04 -04:00
Robin Kobus
31979aefac
[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests ( #6754 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-24 20:53:17 +02:00
ajrasane
068056677f
[None][chore] Enable auto deploy accuracy test in CI ( #7179 )
...
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-24 08:42:30 -07:00
Yanchao Lu
ec35481b0a
[None][infra] Prepare for single GPU GB200 test pipeline ( #7073 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-24 21:46:39 +08:00
dongxuy04
19a0ea363b
[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP ( #6973 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: Dongxu Yang <dongxuy@nvidia.com>
Co-authored-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-08-24 08:15:29 -04:00
Iman Tabrizian
96ff82e77a
[None][fix] Waive test ( #7185 )
...
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2025-08-24 10:45:11 +08:00
Izzy Putterman
b36460d7b5
[None][feat] Deepseek: Start Eagle work ( #6210 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
2025-08-22 12:57:17 -04:00
tomeras91
c232ba8157
[TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H ( #6334 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-08-22 12:15:20 -04:00
Suyog Gupta
e3de5758a3
[ #7136 ][feat] trtllm-serve + autodeploy integration ( #7141 )
...
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-22 08:30:53 -07:00
QI JUN
1388e84793
[None][ci] move all B200 TensorRT test cases to post merge ( #7165 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-22 06:47:23 -04:00
xinhe-nv
b8b2bd4a0a
[TRTLLM-7245][feat] add test_multi_nodes_eval tests ( #7108 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-22 17:17:27 +08:00
Linda
898f37faa0
[None][feat] Enable nanobind as the default binding library ( #6608 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-08-22 09:48:41 +02:00
Daniel Cámpora
099f081e03
[TRTLLM-7155][feat] Unify sampler handle logits implementation. ( #6867 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-22 08:09:30 +02:00
xinhe-nv
4017f7cd6b
[None][chore] Add failed cases into waives.txt ( #7109 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-22 10:39:25 +08:00
Wanli Jiang
07c711eb1f
[TRTLLM-6825][fix] Update lora for phi4-mm ( #6817 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-21 22:00:04 -04:00
dominicshanshan
6f245ec78b
[None][chore] Mass integration of release/1.0 ( #6864 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Bo Deng <deemod@nvidia.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Bo Deng <deemod@nvidia.com>
Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com>
Co-authored-by: Raayan Dhar <58057652+raayandhar@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-22 09:25:15 +08:00
Daniel Stokes
f7c597ec40
[None][perf] Make finalize fusion part of the tactic selection logic ( #6915 )
...
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
2025-08-21 14:08:03 -07:00
Fridah-nv
e18dacc931
[ #4403 ][refactor] Move fusion, kvcache, and compile to modular inference optimizer ( #7057 )
...
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-08-21 10:30:36 -07:00
Emma Qiao
344bc4575d
[None][infra] Waive failed case for main branch ( #7129 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-22 00:08:55 +08:00
Dimitrios Bariamis
f49dafe0da
[ https://nvbugs/5394409 ][feat] Support Mistral Small 3.1 multimodal in Triton Backend ( #6714 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2025-08-21 18:08:38 +02:00
bhsueh_NV
ba0a86e0bb
[ https://nvbugs/5437405 ][fix] qwen3 235b eagle3 ci ( #7000 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-21 01:17:32 -04:00
xinhe-nv
21f4434404
[None][chore] waive failed cases on H100 ( #7084 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-21 11:15:23 +08:00
Chang Liu
75b8a90816
[None][fix] Fix llama4 multimodal by skipping request validation ( #6957 )
...
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-08-20 21:58:53 -04:00
Yechan Kim
0893afae3d
[TRTLLM-6771][feat] Support MMMU for multimodal models ( #6828 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-21 08:54:12 +08:00
bhsueh_NV
73d2daa386
[ https://nvbugs/5457489 ][fix] unwaive some tests ( #6991 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-21 08:49:57 +08:00
QI JUN
a918de710a
[None][ci] move some tests of b200 to post merge ( #7093 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-08-20 19:43:40 -04:00
Emma Qiao
f84dd64250
[None][infra] Waive failed tests on main branch 8/20 ( #7092 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-20 06:33:44 -04:00
Robin Kobus
b95cab2a7c
[None][ci] move unittests to sub-directories ( #6635 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-20 05:42:22 -04:00
Iman Tabrizian
e27088421e
[None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests ( #6936 )" ( #7059 )
...
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
2025-08-20 01:45:09 -04:00
xinhe-nv
9e71b4fda4
[TRTLLM-7205][feat] add llama4 tp4 tests ( #6989 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-20 13:22:05 +08:00
Leslie Fang
3f6a9267f1
[None][infra] update feature_combination_matrix of disaggregated and chunked prefill ( #6661 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-20 13:14:34 +08:00
Chang Liu
ce53832610
[TRTLLM-7326][feat] Add standalone multimodal encoder ( #6743 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-08-19 21:42:50 -07:00
Ivy Zhang
fc85e3db1c
[None][fix] fix llmapi import error ( #7030 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-19 22:58:13 -04:00
Bo Deng
30da5d3cc4
[None][chore] unwaive test_disaggregated_genbs1 ( #6944 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-20 09:57:35 +08:00
Yanchao Lu
d26a5a93ad
[ https://nvbugs/5451296 ][bug] Cherry-pick #7017 from release/1.0 branch ( #7043 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2025-08-19 11:25:05 -04:00
pcastonguay
e07fcc3a22
[ https://nvbugs/5444937 ][chore] Fixing KV events tests ( #7004 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-19 11:18:04 -04:00
zhhuang-nv
7e135d2ea7
[None][feat] Use Separate QKV Input Layout for Context MLA ( #6538 )
...
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-08-19 22:04:48 +08:00
Emma Qiao
8f95f35503
[None][infra] Waive failed tests on main ( #7037 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-19 09:31:07 -04:00
Yiqing Yan
07506bccbe
[None][chore] Remove duplicate test waives ( #7044 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-19 21:04:31 +08:00
Fanrong Li
655d0f48d0
[ https://nvbugs/5455140 ][fix] unwaive DSR1-fp4 throughput_tp8 ( #7022 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-08-19 20:48:05 +08:00
tomeras91
f0bfb49219
[ https://nvbugs/5458874 ][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test ( #6996 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-08-19 15:45:06 +03:00
xinhe-nv
2c86cee38c
[None][chore] Remove closed bugs ( #6969 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-19 16:01:33 +08:00
Shunkangz
54ec2c1af1
[None][opt] Add batch wait timeout in fetching requests ( #6923 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-19 03:50:08 -04:00
Eran Geva
636c622bb8
[ https://nvbugs/5458798 ][fix] Relaxed test threshold, added documentation ( #6997 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-19 00:24:03 -07:00
Ivy Zhang
bff5fdf6df
[TRTLLM-6541][test] Add NIM Related Cases Part 1 ( #6684 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-19 13:59:14 +08:00
William Zhang
daa2a65d37
[ https://nvbugs/5454875 ][ci] Unwaive Mistral Small 3.1 test ( #7011 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-08-19 00:32:14 -04:00
fredricz-20070104
e90280a84d
[TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] ( #6939 )
...
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
2025-08-19 00:13:04 -04:00
Fanrong Li
816a120af6
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell ( #6710 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-08-19 00:03:03 -04:00
Zhenhuan Chen
2bb90ba002
[TRTLLM-6960][fix] enable scaled_mm tests ( #6936 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-08-19 10:18:04 +08:00
Yi Zhang
a15af879ec
[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic ( #6615 )
...
Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-08-19 09:58:44 +08:00
Lizhi Zhou
71e28eab36
[TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models ( #6741 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-08-19 09:58:22 +08:00
Wanli Jiang
dabebb2c7a
[ https://nvbugs/5371480 ][fix] Enable test_phi3_small_8k ( #6938 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-08-19 09:42:35 +08:00
Leslie Fang
e76e5c640f
[None][infra] Enable accuracy test for mtp and chunked prefill ( #6314 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-19 07:42:52 +08:00
Yiqing Yan
1ce23545fc
[None][chore] Remove duplicate test waives ( #6998 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-18 21:15:49 +08:00
Emma Qiao
69ff32f9b1
[None][infra] Waive failed tests on main 0818 ( #6992 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-18 20:34:52 +08:00
Shi Xiaowei
5ec15b98f0
[TRTLLM-7030][fix] uppercase def value in pd-config ( #6981 )
...
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-08-18 02:33:23 -04:00
Leslie Fang
ce0b13ea02
[None][infra] update feature_combination_matrix of disaggregated and Eagle3 ( #6945 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-18 09:18:17 +08:00
Naveassaf
d6322f70b7
[ https://nvbugs/5451028 ][fix] Constrain NemotronSuper test parameters to prevent OOMs ( #6970 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
2025-08-17 13:38:36 -04:00
amitz-nv
3a49b47081
[ https://nvbugs/5390853 ][fix] Fix _test_openai_lora.py - disable cuda graph ( #6965 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-17 16:56:16 +03:00
Emma Qiao
cc6d763824
[None][infra]Waive failed cases in main branch ( #6951 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-17 14:27:59 +03:00
bhsueh_NV
85cbd0263b
[None][feat] Support Yarn on Qwen3 ( #6785 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-08-17 07:21:29 +08:00
Daniel Cámpora
53312eeebd
[TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options ( #6831 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-16 00:27:24 -04:00
brb-nv
9505727d31
[ https://nvbugs/5401114 ][fix] Unwaive Gemma3 tests ( #6952 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-15 16:35:02 -07:00
Yuening Li
1f8ae2b2db
[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow ( #6629 )
...
Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
2025-08-15 17:15:49 -04:00
dongfengy
0ad0b967bb
[None][fix] Make TP working for Triton MOE (in additional to EP we are using) ( #6722 )
...
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
2025-08-15 16:58:42 -04:00
ajrasane
4162d2d746
[None][test] Add accuracy evaluation for AutoDeploy ( #6764 )
...
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-15 13:46:09 -04:00
yifeizhang-c
4127d77678
[ https://nvbugs/5394392 ][fix] Enlarge scheduler capacity under disagg bs == 1 ( #6537 )
...
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-08-15 09:52:06 -07:00
liji-nv
18ccd053d3
[ https://nvbugs/5427801 ][fix] Torch compile support for Llama4 and Ea… ( #6858 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-15 11:14:20 -04:00
peaceh-nv
1c1d5d2495
[ https://nvbugs/5451373 ][fix] : Fix the accuracy issue when using FP8 context MLA ( #6881 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-08-15 16:53:56 +08:00
xinhe-nv
b23fdfc62f
[None][chore] Add failed cases into waives.txt ( #6914 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-08-15 14:00:16 +08:00
Yanchao Lu
3a987891d8
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures ( #6836 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-08-15 11:16:07 +08:00
Bo Deng
e54ba75dac
[None][fix] Update tests to use standardized uppercase backend identifiers ( #6921 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-15 11:14:15 +08:00
Frank
2cc59aacb3
[None][fix] Correct reporting of torch_dtype for ModelConfig class. ( #6800 )
...
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-08-14 22:46:20 -04:00
Aurelien Chartier
b13a5a99b2
[None][chore] Add tests for non-existent and completed request cancellation ( #6840 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-08-14 15:57:01 -07:00
Raayan Dhar
8b237b943b
[ https://nvbugs/5441714 ][chore] remove skip on disagg n-gram test ( #6872 )
...
Signed-off-by: raayandhar <rdhar@nvidia.com>
2025-08-14 15:45:00 -07:00
Bo Li
26f413ad90
[ https://nvbugs/5450262 ][fix] Fix unsupported alltoall use case ( #6882 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-08-14 17:46:54 -04:00
Matthias Jouanneaux
69574ad730
[TRTLLM-5966][feat] Helix: extend mapping to support different CP types ( #6816 )
...
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
2025-08-14 09:00:02 -07:00
Emma Qiao
96339c69a9
[None][infra] Waive failed cases on main ( #6902 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-08-14 23:59:44 +08:00
Pengbo Wang @ NVIDIA
ffc976ceaf
[ https://nvbugs/5445466 ][fix] fix deepseek r1 hang by not enabling mnnvl by default ( #6860 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-08-14 22:36:56 +08:00
Shi Xiaowei
1095dfd03c
[None][fix] BREAKING CHANGE: Mismatch between docs and actual commands ( #6323 )
2025-08-14 03:48:57 -04:00
chenfeiz0326
5cd8c0f6cc
[None][test] Add perf-sweep scripts ( #6738 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-08-14 14:04:47 +08:00
NVJiangShao
a700646132
[None][fix] Add FP4 all2all unitest and fix a bug for module WideEPMoE ( #6784 )
...
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-08-14 13:35:37 +08:00
Yan Chunwei
0132c1db84
[ https://nvbugs/5427043 ][fix] request length exceeds max_num_tokens ( #6821 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-08-14 13:31:12 +08:00
Bo Deng
d8acca495b
[TRTLLM-6675][infra] Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6623 ( #6735 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-14 04:36:38 +00:00
jmydurant
4200fa46d1
[None][feat] Add support for Hopper MLA chunked prefill ( #6655 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-08-14 10:39:26 +08:00
Izzy Putterman
ef53de8eef
[None][feat] Add test for speculative rejection sampler (2-model) ( #6542 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-08-13 22:09:35 -04:00
Mike Iovine
7cba883932
[ https://nvbugs/5410399 ][chore] Unwaive mtp llmapi test ( #6833 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-13 17:38:45 -04:00
Emma Qiao
c7e6145409
[None][infra] Waive failed cases on main ( #6863 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-13 09:50:14 -04:00
Anthony Chang
2198587b35
[ https://nvbugs/5378031 ] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend ( #6200 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-08-13 21:24:40 +08:00
Yukun He
bc5f766e0e
[TRTLLM-4501][feat] AutoTuner tuning config refactor and valid tactic generalization. ( #6545 )
...
* Generalize the definition of tactics so that users can implement more customizable tactic types, making the configurations clearer for each kernel run.
* Allow the user not to specify the `gen_tuning_buckets` or the `map_to_tuning_buckets` function.
* Other code refactoring.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-13 16:25:22 +08:00
Mike Iovine
f68e03e646
[ https://nvbugs/5452167 ][fix] Fix ngram padding issue ( #6837 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-08-13 11:23:16 +08:00
Yechan Kim
12102e2d48
[TRTLLM-6772][feat] Multimodal benchmark_serving support ( #6622 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-12 19:34:02 -07:00
rakib-hasan
2923eb88a1
[None][fix] Refactoring input prep to allow out-of-tree models ( #6497 )
...
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-08-12 20:29:10 -04:00
xinhe-nv
e35fca4272
[TRTQA-2920][chore] improve hang tests ( #6781 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-12 18:26:51 +08:00
Sergey Klevtsov
27fc35175e
[None][feat] CUTLASS MoE FC2+Finalize fusion ( #3294 )
...
Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2025-08-12 15:56:48 +08:00
Fridah-nv
0dc4b4e699
[ #4403 ][autodeploy] Refactor: Move more transformations to new inf optimizer, Add quantization_source to factory interface ( #6760 )
...
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
2025-08-11 22:02:46 -07:00
Enwei Zhu
7c686ba8de
[TRTLLM-2285][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill ( #6774 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-12 09:30:06 +08:00
Ziyi Xiong
b4fcd5f592
[ https://nvbugs/5441438 ][fix] Set correct draft length for the cuda graph dummy request ( #6701 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-08-12 09:28:47 +08:00
Jinyang Yuan
ead89a0e40
[None][perf] Improve the performance of online EPLB on Hopper by better overlapping ( #6624 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-08-12 09:25:13 +08:00
Chang Liu
be9dd4713c
[ https://nvbugs/5385987 ][fix] Fix Qwen2 quantization issue by pinning transformers version ( #6673 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-08-11 17:16:49 -07:00
Aurelien Chartier
56bfc3a6d2
[None][chore] Find LLM_ROOT and LLM_BACKEND_ROOT dynamically ( #6763 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-08-11 15:18:19 -07:00
rakib-hasan
7ab8112450
[None][fix] Refactoring to avoid circular import when importing torch models ( #6720 )
...
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-08-11 18:00:42 -04:00
Emma Qiao
5145e9d40e
[None][infra] Unwaive an updated case to test ( #6791 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-11 06:47:33 -04:00
Emma Qiao
d6ad4a9d5b
[None][infra] Waive failed tests on main 0811 ( #6778 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-11 03:16:25 -04:00
xinhe-nv
9c358c26e4
[None][chore] remove closed bugs ( #6772 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-11 14:39:58 +08:00
Eran Geva
b3e8fa2960
[None][test] Test trtllm-bench AD vs, PT BEs on H100 single gpu ( #6487 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2025-08-11 08:33:13 +03:00
Tracin
49bcaa4e95
Add gpt-oss GSM8K test. ( #6732 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-08-10 22:45:43 -04:00
Chuang Zhu
c566a8d2a2
[None][fix] fix same pp disagg ( #6730 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-08-10 22:45:15 -04:00
Bo Deng
767879ef85
[ https://nvbugs/5431127 ][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper ( #6736 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-11 10:05:10 +08:00
Yechan Kim
60073a7ad9
[None][feat] Support SharedTensor on MultimodalParams ( #6254 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-10 17:48:24 -07:00
pcastonguay
4142320e53
[ https://nvbugs/5444937 ][fix] Fixing kv_cache_event unit test ( #6753 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-10 16:45:38 -07:00
shaharmor98
14b36e07d7
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache ( #6574 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-08-10 16:27:51 -04:00
Gal Hubara-Agam
3c5aec19c2
[ #5048 ][enhance] AutoDeploy: Optimize prepare_inputs ( #6634 )
...
Optimize prepare_inputs routine in AutoDeploy, as part of the effort to reduce the performance gap compared to the default backend.
This PR includes two major fixes, and some other minor tweaks:
1. Avoid back and forth data copies
2. Optimize position ids update by separating the implementation for generation mode and context mode.
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-08-10 13:55:04 +03:00
Emma Qiao
ee19ca5e58
[None][infra] Waive test main 0808 ( #6751 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-09 23:54:07 -04:00
Ye Zhang
bcf5ec0c9a
[None][feat] Core Metrics Implementation ( #5785 )
...
Signed-off-by: Ye Zhang <zhysishu@gmail.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-09 02:48:53 -04:00
Stefan Niebler
b8f036f264
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding ( #6665 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-08-08 14:00:33 +02:00
Leslie Fang
294e0d3dab
[ https://nvbugs/5436461 ][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI ( #6631 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-08 15:30:47 +08:00
Li Min
d913955952
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell ( #6616 )
...
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-08-08 15:03:48 +08:00
ruodil
b15d6fb145
[None][test] fix yml condition error under qa folder ( #6734 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-08-08 15:59:01 +10:00
2ez4bz
064eb7a70f
[TRTLLM-5252][fix] Propagate mapping to intermediate layers ( #6611 )
...
This commit propagates the mapping to intermediate layers to enable
tensor parallelism (amongst other things) in them.
It also fixes issues with a unit test for TP for pixtral, and adds it to a
test list.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-08-08 01:50:36 -04:00
Enwei Zhu
aee828d98a
[TRTLLM-6854][feat] Enable guided decoding with disagg serving ( #6704 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-08 12:10:36 +08:00
ruodil
22f45a0e19
[TRTLLM-5252][test] add for mistral_small_3.1_24b perf test ( #6685 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-08-07 22:57:04 -04:00
xinhe-nv
88ced50ca7
[TRTQA-2920][fix] Add failed cases into waives.txt ( #6719 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-08-08 12:54:13 +10:00
Daniel Cámpora
efca359b66
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default ( #6216 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-08-07 22:19:37 -04:00
Iman Tabrizian
82276167e6
[None][feat] Add NCCL Symmetric Integration for All Reduce ( #4500 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-08-07 17:28:14 -07:00
Haohang Huang
980929e1a9
[ https://nvbugs/5410687 ][fix] Hopper w4a8 groupwise MoE interleave ( #6708 )
...
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-08-07 15:30:16 -07:00
Yuan Tong
db8dc97b7b
[None][fix] Migrate to new cuda binding package name ( #6700 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-07 16:29:55 -04:00
Raayan Dhar
4055b764db
[None][fix] disagg ctx pp4 + gen pp4 integ test ( #6489 )
...
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Raayan Dhar <58057652+raayandhar@users.noreply.github.com>
2025-08-07 11:18:02 -04:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events ( #6563 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
Enwei Zhu
1b9781e8e7
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) ( #6300 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-08-07 05:53:48 -04:00
peaceh-nv
8ec3b1de10
[None][feat] : Add FP8 context MLA support for SM120 ( #6059 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-08-07 16:16:34 +08:00
xinhe-nv
0a467b00cc
[ https://nvbugs/5409414 ][fix] fix Not registered specs ( #6660 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-07 17:55:53 +10:00
hlu1
8207d5fd39
[None] [feat] Add model gpt-oss ( #6645 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-07 03:04:18 -04:00
ruodil
6c1f7d8b91
[None][test] correct test-db context for perf yaml file ( #6686 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-08-07 02:47:10 -04:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter ( #6510 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
YueWeng
157ea77549
[ https://nvbugs/5375966 ][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one ( #6658 )
...
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-08-07 10:25:17 +08:00
ruodil
780d7507f9
[None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml ( #6662 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-07 10:02:13 +10:00
ruodil
f30398470d
[None][chore] update readme for perf release test ( #6664 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-07 10:00:45 +10:00
Yan Chunwei
5eae3184fa
[None][chore] add missing tests to test list ( #6590 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-08-06 22:12:27 +08:00
Yechan Kim
1aed7511fe
[ https://nvbugs/5430124 ][fix] Mistral mixture_text_image test case fix ( #6648 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-06 06:58:58 -07:00
Iman Tabrizian
13ecb4aced
[ https://nvbugs/5328160 ][fix] Unwaive disaggregated serving tests ( #6644 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-08-06 09:08:29 -04:00
Pengyun Lin
79fc2f48c0
[None][chore] Enhance trtllm-serve example test ( #6604 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-08-06 20:30:35 +08:00
Zongfei Jing
0ff8df95b7
[ https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA ( #6588 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-08-06 16:44:21 +08:00
ruodil
907c180eb2
[None][test] align kv_frac in perf test with perflab and add more cases for 4 gpus GB200 ( #6632 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-08-06 02:25:57 -04:00
Iman Tabrizian
43bd861ce1
Update allreduce benchmark for torch ( #6271 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-08-05 23:25:23 -07:00
ruodil
0bd99b5d6d
[TRTLLM-6764][test] add new feature cases in cluster(B200/GB200) and sanity test ( #6650 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-08-06 01:45:13 -04:00
yunruis
3ff4f503ad
[None][opt] ADP schedule balance optimization ( #6061 )
...
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
2025-08-06 09:38:02 +08:00
Yechan Kim
c17f4984e2
[None][feat] Refactor Llava-Next ( #6478 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-05 17:53:53 -07:00
Aurelien Chartier
6da95f29a9
[None][feat] Add support for fused gate_up_proj scales for FP8 blockwise ( #6496 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-08-05 11:22:32 -07:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling ( #5170 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
liji-nv
dcbfa7e509
[ https://nvbugs/5252313 ][fix] Fix torch compile + MTP ( #6554 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-05 10:31:29 -04:00
Venky
61da2daeb4
[TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system ( #6464 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-08-05 07:14:24 -07:00
Emma Qiao
78a75c2990
[None][Infra] - Split gb200 stages for each test ( #6594 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-05 07:10:00 -04:00
xinhe-nv
c32584125e
[TRTQA-2920][fix] Add failed cases into waives.txt ( #6600 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-05 20:12:55 +10:00
Pengbo Wang @ NVIDIA
c289880afb
[None][fix] fix kimi k2 serving and add test for Kimi-K2 ( #6589 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-08-05 18:05:33 +08:00
Ivy Zhang
08ed9d7305
[None][doc] add introduction doc on qa test ( #6535 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-05 17:02:17 +08:00
Ivy Zhang
d101a6cebc
[ https://nvbugs/5410279 ][test] resubmit timeout refactor ( #6337 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-05 16:39:25 +08:00
Haohang Huang
c9eebcb454
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec ( #6379 )
...
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Signed-off-by: symphonylyh <31998628+symphonylyh@users.noreply.github.com>
2025-08-05 07:47:41 +00:00
Leslie Fang
164acfa31e
[None][infra] Skip test_eagle3 test with device memory check ( #6617 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-05 02:36:03 -04:00
ruodil
7625845365
test: add README_release_test.md for perf test ( #6443 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-08-05 02:07:42 -04:00
xinhe-nv
a178cea324
[TRTLLM-6856][feat] add disaggregated serving tests to QA list ( #6536 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-05 12:47:53 +10:00
xinhe-nv
fe3d607c4b
[TRTQA-2920][fix] Add failed cases into waives.txt ( #6581 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-05 12:41:23 +10:00
brb-nv
6135f75f87
[None][chore] Update Gemma3 closeness check to mitigate flakiness ( #6591 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-04 10:10:58 -04:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow ( #5997 )
...
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
Ivy Zhang
f3651adea8
[None][test] update invalid test name ( #6596 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-04 08:01:05 -04:00
Emma Qiao
5d8a5a0cb8
[None][Infra]Waive failed case in post-merge on main ( #6602 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-04 19:39:44 +08:00
brb-nv
87e4e9f468
[None][chore] Add unit test for Gemma3 lora ( #6560 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-04 04:56:57 -04:00
Pengyun Lin
a15e33351d
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens ( #6259 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-08-04 15:09:51 +08:00
xinhe-nv
a54972e463
[None][fix] remove closed bugs ( #6576 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-04 15:52:11 +10:00
Yuan Tong
a2f271c8e0
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory ( #5034 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-08-04 13:51:01 +08:00
Leslie Fang
b9fe0fa7ec
[None][infra] Enable test of chunked prefill with logit post processor ( #6483 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-04 01:46:07 -04:00
Leslie Fang
a60190836c
[None][infra] Enable accuracy test for eagle3 and chunked prefill ( #6386 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-08-04 01:45:24 -04:00
ruodil
6459725bf9
test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) ( #6533 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-04 15:22:39 +10:00
Ivy Zhang
5eefdf2c75
tests: Add llama4 functional cases ( #6392 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-04 11:19:58 +08:00
ruodil
8d82ccca63
test: modify max_lora_rank of phi4_multimodal to 320 ( #6474 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-08-04 12:20:22 +10:00
Yechan Kim
ee6ab5be96
chore: add EXAONE4 accuracy test ( #6397 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-04 10:14:16 +08:00
Ivy Zhang
7547a7d0a2
[TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list ( #6436 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-03 22:11:26 -04:00
Yiqing Yan
3f7abf87bc
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 ( #5678 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-08-03 11:18:59 +08:00
Jhao-Ting Chen
4da5cfc511
[None][infra] add eagle3 one model accuracy tests ( #6264 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-08-02 16:07:46 -07:00
Shunkangz
67a3fd858b
[None][feat] Add support of scheduling attention dp request ( #6246 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-01 20:38:01 -04:00
Richard Huo
31802de0b0
[None][fix] Serialize the window_size in the kv event ( #6526 )
...
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
2025-08-01 15:25:18 -07:00
Lizhi Zhou
6f34f3489b
[TRTLLM-6357][test] Add accuracy tests for Qwen3 ( #6177 )
...
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
2025-08-01 13:33:34 -04:00
xinhe-nv
263c6c0ad0
test: skip post blackwell ( #6357 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-01 13:10:14 -04:00
Lucas Liebenwein
5247df6ae2
[AutoDeploy] merge feat/ad-2025-07-22 ( #6520 )
...
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Gal Agam <ghubaraagam@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: haoguo <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Gal Agam <ghubaraagam@cw-dfw-h100-004-328-012.cm.cluster>
Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Co-authored-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-08-01 08:51:08 -07:00
Emma Qiao
16febefee0
[None][Infra] - Skip failed tests in post-merge ( #6558 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-08-01 22:21:23 +08:00
brb-nv
7447d6ed85
[TRTLLM-6657][feat] Add LoRA support for Gemma3 ( #6371 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-08-01 09:19:54 -04:00
liji-nv
1daa8c3232
[ https://nvbugs/5340941 ][ https://nvbugs/5375785 ] - fix: Wrap attentio… ( #6355 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-08-01 07:38:06 -04:00
xinhe-nv
fca0d37798
[None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 ( #6552 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-08-01 20:27:22 +10:00
chenfeiz0326
ba5bdbb138
[None][chore] Disable add special tokens for Llama3.3 70B ( #6482 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-08-01 17:03:27 +08:00
Yukun He
90856bf97d
[ https://nvbugs/5419069 ][fix] Fix the mismatched layer name components. ( #6417 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-08-01 16:32:39 +08:00
Yang Li
ac23f4a80d
[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops ( #6515 )
...
Signed-off-by: Yang Li <56944310+yali-arch@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-08-01 15:59:09 +08:00
Ivy Zhang
71524a1a48
[ https://nvbugs/5419066 ][fix] Use trt flow LLM ( #6467 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-08-01 03:33:07 -04:00
Venky
ad5742b105
[fix] Update get_trtllm_bench_build_command to handle batch size and tokens ( #6313 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-08-01 00:08:09 -04:00
Zongfei Jing
7bb0a78631
Deepseek R1 FP8 Support on Blackwell ( #6486 )
...
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-08-01 10:26:28 +08:00
brb-nv
2eca0d5925
fix: Fix poor generation with FP8 Gemma3 1B checkpoint ( #6499 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-31 17:18:23 -07:00
Simeng Liu
8cf3faa26a
[feat] Auto-enable ngram with concurrency <= 32. ( #6232 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Mike Iovine <miovine@nvidia.com>
Signed-off-by: Mike Iovine <mike.iovine7@gmail.com>
Co-authored-by: Mike Iovine <miovine@nvidia.com>
Co-authored-by: Mike Iovine <mike.iovine7@gmail.com>
2025-07-31 18:45:51 -04:00
Ziyi Xiong
8062e0fe7c
[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically ( #6363 )
...
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-31 15:31:39 -04:00
tomeras91
6d5da9f7c2
[ https://nvbugs/5404046 ][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test ( #6485 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-31 21:35:10 +03:00
shaharmor98
0c42f54a39
Bugfix/fix nemotron nas lora support ( #6380 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-31 13:39:35 -04:00
amitz-nv
1ee7a08d2b
[5830][feat] Improve LoRA cache memory control ( #6220 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-31 09:26:38 +03:00
Faraz
8e84df74b5
Fix e2e test failure for RTX6000 Pro ( #6420 )
...
Signed-off-by: list <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: Faraz <58580514+farazkh80@users.noreply.github.com>
2025-07-30 23:32:44 -04:00
xinhe-nv
ca534e4798
test: add accuracy reference ( #6479 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-31 12:27:29 +10:00
bhsueh_NV
ae3a5fc918
[doc][ci][Qwen3][nvbugs 5374145] Add Qwen3 235B eagle3 CI ( #6477 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-31 09:37:23 +08:00
brb-nv
0e16d1f070
test: Add time logging for lora tests ( #6466 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-30 14:02:43 -07:00
Anurag Mukkara
fac186e3b5
[nvbug/5409417] Unwaive llava test case ( #6460 )
...
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-07-30 14:38:47 -04:00
brb-nv
f6287e4498
Unwaive Gemma2 LoRA test on H100 ( #6461 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-30 12:56:12 -04:00
Bo Deng
24e7f4eece
[nvbug/5410296][fix] Fix OOM in Llama 4 disagg-serve tests ( #6439 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-31 00:41:37 +08:00
Wanli Jiang
9632dba02e
feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm ( #6353 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-30 09:20:16 -07:00
pcastonguay
0f083b9daf
fix: Unwaive triton cpp test [nvbug 5401088] ( #6412 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-07-30 11:25:18 -04:00
nv-guomingz
03e38c9087
chore: update trtllm-serve usage doc by removing backend parameter when it use torch as backend. ( #6419 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-30 11:11:06 -04:00
Chang Liu
b4065d8ca6
[TRTLLM-6654][feat] Add support for external multimodal embeddings ( #6263 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-07-30 10:00:15 -04:00
pcastonguay
e7ae5e2824
feat: Add support for disaggregation with pp with pytorch backend ( #6369 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Signed-off-by: raayandhar <rdhar@nvidia.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: raayandhar <rdhar@nvidia.com>
Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-07-30 09:42:13 -04:00
tomeras91
a2514d93fc
[nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 ( #6447 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-30 07:22:32 -04:00
Yechan Kim
22b29df38c
[nvbugs/5414909] fix: Qwen2-VL keyword on L20 ( #6427 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-30 17:29:55 +08:00
xinhe-nv
d9ab3fd35e
tests: add TestNemotronH cuda graph tests ( #6390 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-30 18:45:58 +10:00
nv-guomingz
a5540acfce
chore: add trtllm-serve json schema example into doc. ( #6418 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-30 04:33:08 -04:00
2ez4bz
d6eed1b624
[fix] Switch placement of image placeholder for mistral 3.1 ( #6435 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-30 14:10:36 +08:00
xinhe-nv
c00d6763b2
test: [CI] Add failed cases into waives.txt ( #6457 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-07-30 12:36:58 +10:00
Venky
ab40369053
[fix] Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests ( #6463 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-30 10:53:43 +10:00
Yechan Kim
d6eb8e2366
fix: support mixture of text & multimodal prompts ( #6345 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-30 08:52:31 +08:00
Yan Chunwei
ad662ddcdd
chore: disallow arbitrary in llm_args.Configs ( #6367 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-29 16:16:52 -04:00
Yan Chunwei
1a6930986a
chore: remove unused kv_cache_dtype in api reference ( #6444 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-29 14:57:20 -04:00
Michal Guzek
7efe3cb0cd
[fix] Add detokenization-based stop word logic to LLM API ( #5948 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
2025-07-29 10:16:59 -07:00
xinhe-nv
f1086e7d4f
test: [CI] remove closed bugs ( #6381 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-29 19:01:23 +10:00
xinhe-nv
4fbb344caf
test: [CI] Add failed cases into waives.txt ( #6423 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-29 19:00:30 +10:00
Yukun He
0eee2e2850
[5385981] fix: Update the usage of VisionAttention init API. ( #6413 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-07-29 16:41:48 +08:00
ruodil
e11255e9d0
test:[nvbug 5415268] add kv_cache_free_gpu_mem_fraction param and llama4 rcca cases ( #6430 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-07-29 15:52:45 +10:00
Michal Guzek
2573bb729d
feat: Add Phi-4-Mini-Instruct in Pytorch backend for LLM API accuracy tests ( #6303 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-28 14:02:14 -07:00
Aurelien Chartier
738ab61593
[nvbugs/5404000] fix: waive request_perf_metrics_draft test on pre-Hopper GPUs ( #6339 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-28 12:36:44 -07:00
2ez4bz
cdca541148
[test] Unwaive mistral3.1 small E2E test ( #6352 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-28 14:37:42 -04:00
2ez4bz
60e4d3a9d4
[test] Add accuracy regression test for Mistral3.1 ( #6322 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-28 09:41:44 -07:00
ruodil
03632a679f
test: organize perf cases and add missing perflab cases in qa test list ( #6283 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-07-28 20:33:32 +10:00
xinhe-nv
971be1fe86
test: waive failed cases ( #6394 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-28 20:31:43 +10:00
Yan Chunwei
45d441e60c
[TRTLLM-5061] chore: add status tags to LLM API reference ( #5707 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-28 15:57:07 +08:00
Ivy Zhang
2945817cae
[nvbug/5409414, 5355707] tests: adjust batchsize and decoding name ( #6292 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-07-28 15:33:30 +08:00
Emma Qiao
b3ca159787
[Infa] - waive failed cases and fix a typo ( #6384 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-28 02:06:57 -04:00
Chang Liu
dc757799e1
[nvbugs/5401156][fix] Avoid import all models when import trtllm._common ( #6266 )
2025-07-27 23:29:21 -04:00
Yan Chunwei
908f49a4ad
[nvbug/5320234] fix: test_trtllm_bench_llmapi_launch ( #6359 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-28 09:01:10 +08:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache ( #5974 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
Iman Tabrizian
c35c78ff58
[fix][nvbugs/5390810] Improve the check for disaggregated serving test ( #6301 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-25 12:47:01 -07:00
nv-guomingz
b8d4cb8beb
feat: Support JSON Schema in OpenAI-Compatible API ( #6321 )
...
Signed-off-by: noiji <52301388+noiji@users.noreply.github.com>
2025-07-25 12:55:56 -04:00
pcastonguay
3805976e90
fix: Fixing kv_cache_events unit tests [nvbug 5362412] ( #6265 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-07-25 08:55:44 -04:00
xiaoqi
a0aecf0476
[feat]: support logit_bias ( #5354 )
...
Signed-off-by: xq25478 <xq25478@qq.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-25 09:37:41 +00:00
xinhe-nv
470544cf17
test: [CI] Add failed cases into waives.txt ( #6333 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-25 17:18:06 +10:00
xinhe-nv
6268a60ab3
tests: add test_chunked_prefill for llama4 ( #5549 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-24 23:02:00 -04:00
xinhe-nv
2dcfa90e99
test: skip llama3.3 70b test on cg4 ( #6293 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-24 19:29:56 -07:00
Mike Iovine
0f2f11f90b
[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model ( #6104 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-24 21:50:11 -04:00
Shiyu Li
375f74ecb2
[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. ( #6237 )
...
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-07-25 08:01:40 +08:00
Stefan Niebler
0df758ec9f
[TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration ( #6217 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-24 18:04:41 +02:00
bhsueh_NV
7b6aadc800
[Fix][nvbug 5401163][nvbug 5404726][Qwen3] Fix bug of MoE on tp > 1 with trtllm moe backend ( #6235 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-24 21:47:37 +08:00
Emma Qiao
0cc1f8c03d
[Infra] - Wiave failed tests in post-merge ( #6331 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-24 21:18:06 +08:00
Ivy Zhang
f290108cd8
tests: only get timeout value from pytest marker ( #6287 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-07-24 20:51:02 +08:00
liji-nv
14d94a3856
feat: Add non UB AR + Residual + Norm + Quant fusion ( #6320 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-24 05:51:43 -04:00
Iman Tabrizian
5fceaa6153
Revert "tests: add timeout_manager to tensorrt flow test cases ( #5942 )" ( #6309 )
2025-07-23 23:58:10 -04:00
Emma Qiao
82d03ca979
[Infra] - Increase unittest execution time since some test exceeds 1600 ( #6277 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-24 10:02:28 +08:00
Iman Tabrizian
7740bfa31d
Waive tests ( #6312 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-23 18:15:07 -07:00
Lucas Liebenwein
cf4f4e8d73
[AutoDeploy] disable flaky MoE nvfp4 test ( #6302 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-07-23 13:13:01 -04:00
Emma Qiao
cb737a5fcd
[Infra] - Skip failed cases ( #6299 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-23 21:26:31 +08:00
Stefan Niebler
2486eb778e
[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler ( #6223 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-23 12:30:50 +02:00
xinhe-nv
2b0fa24175
test: [CI] Add failed cases into waives.txt ( #6289 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-07-23 19:04:21 +10:00
YueWeng
ed62a06eef
[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue ( #6136 )
...
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-07-23 14:53:37 +08:00
Yechan Kim
83c3ed128b
chore: set default device to cpu on Multimodal models ( #5994 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-22 21:45:31 -07:00
Venky
9538c8d0e5
Add basic Nemo Ckpt Lora Loading in pytorch flow ( #6019 )
2025-07-22 19:42:45 -07:00
wili
8ecdeee300
[refactor] Simplification of Speculative decoding configs - Part 2 ( #5936 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-23 09:20:27 +08:00
Iman Tabrizian
bc2fb29c5e
[nvbugs/5401261][fix] Fix Triton backend disaggregated serving support ( #6224 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-23 05:27:16 +08:00
Lucas Liebenwein
41fb8aa8b1
[AutoDeploy] merge feat/ad-2025-07-07 ( #6196 )
...
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-07-23 05:11:04 +08:00
2ez4bz
ab7434ac62
[feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM ( #6152 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-22 11:06:41 -07:00
John Calderon
b7c8a672da
[Issue 6193] Fix gemma3vl weight loader ( #6233 )
...
Signed-off-by: John Calderon <johncalesp@gmail.com>
2025-07-22 10:32:18 -07:00
Linda
60073731ca
fix: bindings unit tests for nanobind ( #6221 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-22 14:51:43 +01:00
Stanley Sun
04f2d4b2eb
test: update test list for RTX6KD ( #6213 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-07-22 18:55:24 +08:00
Pengyun Lin
48ddc3d4b9
[fix]: Revert commit 388b491 ( #6143 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
pcastonguay
310bdd9830
fix: Fix triton backend build [nvbug 5396469] ( #6098 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Yi Zhang
eb7d0f84b5
[nvbugs/5368410][fix] Disable moe allreduce for multi node ( #5918 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Nikita Korobov
9d26b7891a
fix: [5328141] increase tolerance for test_fp8_block_scale_gemm ( #5849 )
...
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Yan Chunwei
f194b65f3e
fix [nvbug/5351244]: address remote mpi session submit ( #5664 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Bo Li
537757e669
fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. ( #5896 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-22 12:48:00 +08:00
Bo Li
db77d83a2a
bug: [ https://nvbugs/5368507 ] Fix test_generate_with_seed. ( #6206 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-22 12:28:38 +08:00
2ez4bz
37d0b68442
[fix] Fix flaky mistral E2E test ( #6230 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-22 11:55:28 +08:00
WeiHaocheng
fddb7f1141
feat: moe prepare support topk % 4 != 0 ( #5742 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-07-22 10:42:46 +08:00
Ivy Zhang
eb5cb5b642
tests: add timeout_manager to tensorrt flow test cases ( #5942 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-07-22 10:23:41 +08:00
Shunkangz
ee45e0c63f
feat: Refactor the fetching request logic ( #5786 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-07-22 09:16:28 +08:00
Chang Liu
7381f1dba7
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models ( #5444 )
...
Only supports qwen in this PR
2025-07-21 16:11:58 -07:00
Simeng Liu
4a0951f85c
[Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] ( #5859 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-07-21 15:46:37 -07:00
Mike Iovine
9645814bdf
[chore] Clean up quickstart_advanced.py ( #6021 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-21 15:00:59 -04:00
Yi Zhang
f9b0a911fb
test: Enable GB200 torch compile multi gpu tests ( #6145 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-21 22:17:13 +08:00
Pengyun Lin
9832bef07d
[BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve ( #5717 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-21 21:09:43 +08:00
Emma Qiao
e41507a253
[Infra] - Waive failed cases on recent post-merge ( #6212 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-21 21:00:18 +08:00
liji-nv
3e0fb60e50
[TRTLLM-4279] feat: Multistream initial support for torch compile flow ( #5847 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-21 19:10:22 +08:00
Linda
3efad2e58c
feat: nanobind bindings ( #6185 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
xinhe-nv
b46fd41026
test: [CI] remove closed bugs ( #6201 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-07-21 15:40:30 +08:00
Yuening Li
e8c068b4b1
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow ( #5850 )
...
Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>
2025-07-21 15:17:35 +08:00
brb-nv
ca9bc5727e
fix: Flush stale PlanParams with custom attention mask ( #6163 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-21 09:55:09 +08:00
ruodil
6a3c9f8061
test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test ( #5826 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-21 11:29:19 +10:00
danielafrimi
5300a99bd8
W4A8 GEMM ( #6005 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-20 17:34:57 +03:00
amitz-nv
98428f330e
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction ( #5616 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-20 08:00:14 +03:00
bhsueh_NV
2e14c8f443
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 ( #6065 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-20 10:25:25 +08:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support ( #6133 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
wili
82d3587bb8
[refactor] Unify name of NGram speculative decoding ( #5937 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-19 12:59:57 +08:00
xiaoqi
28858c8711
feat(eagle3):support qwen3 dense model ( #5879 )
...
Signed-off-by: xq25478 <xq25478@qq.com>
2025-07-19 01:24:32 +08:00
Venky
22d4a8c48a
enh: Add script to map tests <-> jenkins stages & vice-versa ( #5177 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-19 00:50:40 +08:00
Bo Deng
2c6fa145ee
[TRTLLM-6471] Infra: unwaive nixl tests and some disagg-serve tests ( #6095 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-07-19 00:48:44 +08:00
Stefan Niebler
fd6ce7f20e
[ci] Speedup beam search unit tests with fixtures for LLM ( #5843 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-18 22:54:49 +08:00
Erin
9522cde464
fix: NVBug 5385576 py_batch_idx issue ( #6153 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-18 22:36:43 +08:00
Emma Qiao
77acb4f753
[Infra] - Waive failed tests in post-merge ( #6176 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-18 17:34:34 +08:00
Chuang Zhu
c0e416535e
fix single_disagg_test ( #6166 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-18 13:18:37 +08:00
Zhenhuan Chen
992b273045
[ https://nvbugs/5387375 ] fix(scaffolding): fix scaffolding aime test in test_e2e ( #6140 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-18 10:34:37 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
2ez4bz
8480c120b1
[fix] Fix Mistral3VLM weight-loading & enable in pre-merge ( #6105 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-17 11:04:17 -07:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Stanley Sun
9518e14f69
test: fix PytestUnknownMarkWarning: Unknown pytest.mark.timeout ( #6115 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-07-17 20:55:04 +10:00
Yi Zhang
a718486900
fix: Fix DeepSeek R1 CI ( #6129 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-17 18:24:49 +08:00
nv-guomingz
9b45499caa
test: update max_beam_width to 1 due to torchsampler changes. ( #6101 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-17 18:05:45 +08:00
Erin
de60ae47e3
chores: unwaive a few tests for v1.0 ( #6107 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-17 17:59:51 +08:00
Enwei Zhu
21efb50068
[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler ( #6000 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-17 17:46:10 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service ( #5234 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
Iman Tabrizian
d4d21a106e
[fix] Release slots with spec decode + disagg ( #5975 ) ( #6032 )
...
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-17 12:58:18 +08:00
chenfeiz0326
fe070a0168
test: Update Llama4 Scout FP4 & FP8 accuracy tests ( #5901 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-07-17 09:41:18 +08:00
Wanli Jiang
2d2b8bae32
feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support ( #5644 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-17 06:30:58 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations ( #5868 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
shaharmor98
e0836f9ca9
[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats ( #5372 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-17 00:50:30 +08:00
Wanli Jiang
9354114f68
fix: Update trtllm args issues with extra nested config ( #5996 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-16 12:41:45 -04:00
Emma Qiao
e30d7bec38
[Infra] - Waive failed cases in post-merge on main ( #6096 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-16 22:41:18 +08:00
Yan Chunwei
a02606a9e2
[TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend ( #5752 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:42:59 +08:00
Ivy Zhang
dda91b5117
tests: add QA test cases ( #5959 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-07-16 16:14:25 +08:00
Yan Chunwei
7568deb2f1
[nvbug/5387226] chore: add propogation for trust_remote_code to AutoConfig ( #6001 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:05:38 +08:00
Ivy Zhang
763012a88a
[nvbug/5359218][tests] add test llm api test case on lookahead with chunked prefill ( #6051 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-07-16 16:04:08 +08:00
peaceh-nv
f5f31beee1
feat: Add deepseek-lite tests for RTX pro 6000 ( #5903 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-16 15:51:45 +08:00
Zheng Duan
385af53a4d
[nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test ( #6041 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-16 13:52:13 +08:00
Wanli Jiang
8679a058a3
fix: Unable to load phi4-model with tp_size>1 ( #5962 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-16 11:39:41 +08:00
Aurelien Chartier
6a47cac981
feat: Add support for Triton request cancellation ( #5898 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-15 20:52:43 -04:00
danielafrimi
edab7532dd
feat/add latency support for trtllm bench ( #3730 )
...
Signed-off-by: Ubuntu <dafrimi@nvidia.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Daniel Afrimi <dafrimi@nvidia.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-07-15 13:13:49 -07:00
brb-nv
9214ac662a
test: Add regression tests for Gemma3 VLM ( #6033 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-15 11:37:56 -07:00
Fanrong Li
7a1af1c738
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/5947 ( #5989 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-07-16 01:33:12 +09:00
MinaHuai
9ebc3ab9c4
[nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision ( #5998 )
...
Signed-off-by: Mina Huai <121143971+MinaHuai@users.noreply.github.com>
2025-07-15 10:01:35 -04:00
Jaedeok Kim
ab1c54709d
fix: adjust window sizes of VSWA at torch backend ( #5880 )
...
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-15 17:41:54 +08:00
ruodil
2a147c4d01
test: add llama_v3.3_70b_cases in perf test ( #6035 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-07-15 17:53:59 +10:00
ruodil
2504aa552e
test: add recursive updating pytorch config and change MOE backend format in perf test ( #6046 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-07-15 17:53:15 +10:00
nv-guomingz
4e4d18826f
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… ( #6003 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-15 15:50:03 +09:00
Yiqing Yan
6b35afaf1b
[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report ( #5672 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-15 12:27:21 +09:00
ixlmar
f225f5cd2e
[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs ( #5964 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-15 06:49:42 +08:00
Iman Tabrizian
c4ee535afb
[fix] fix eagle3 two model disaggregated serving test ( #6014 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-15 04:26:04 +09:00
brb-nv
f5f5be9e94
enh: Bidirectional mask with multiple images for Gemma3 ( #5976 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 22:39:18 +08:00
brb-nv
1a2d96919c
feat: Update Gemma3 Vision Encoder ( #5973 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 22:38:10 +08:00
Clay
dbf29184dc
fix #4974 : A thread leak issue in scaffolding unittest ( #5020 )
...
Signed-off-by: Clay <ccs96307@gmail.com>
2025-07-14 20:22:03 +09:00
Kaiyu Xie
aa97fbb2ad
[Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage ( #5882 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-14 20:21:46 +09:00
Yiqing Yan
c720d7f779
Waive L0 test ( #6002 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-14 19:55:34 +09:00
Zhanrui Sun
3a0ef73414
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check ( #5709 )
...
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-14 18:52:13 +09:00
Zhenhuan Chen
30608a5e6d
[ https://nvbugs/5355316 ] fix: update torch.compile option to fix triton store_cubin error ( #5865 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-14 17:17:30 +08:00
Robin Kobus
5a61d64b5b
[nvbugs/5345391] fix: chunked prefill + overlap scheduling ( #5761 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Pengyun Lin
3fcaa8a310
[nvbug 5327706][fix] fix mgmn postprocess error ( #5835 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
ruodil
347520494b
test: remove duplicate cases in perf sanity test ( #5870 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Bo Li
6d79559f3e
fix: [ https://nvbugs/5351130 ][ https://nvbugs/5333654 ] Unwaive for bug 5351130 and 5333654. ( #5821 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Bo Li
2991cf4b80
fix: [ https://nvbugspro.nvidia.com/bug/5345215 ] Unwaive for bug 5345215. ( #5606 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Yan Chunwei
3e1fd983c3
[nvbug5266240] chore: unwaive test_llm_with_dummy_weights ( #5744 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Pengyun Lin
388b4919b8
[nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend ( #5541 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Pengyun Lin
6992616c1f
[nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens ( #5201 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
ruodil
278a1a7df3
test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases ( #5693 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Iman Tabrizian
c8874a7f94
[nvbug/5337601][fix] Fix disagg + speculative decoding ( #5558 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Yi Zhang
9cc4e5d50e
[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation ( #5463 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Yi Zhang
e5e87ecf34
test: Move some of the test from post merge to pre-merge, update dgx b200 test case ( #5640 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
brb-nv
869e88304a
[nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test ( #5735 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
dominicshanshan
c9e7f831dc
Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default ( #5480 )
...
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-07-14 16:42:23 +08:00
Yan Chunwei
9c673e9707
[TRTLLM-6160] chore: add sampling examples for pytorch ( #5951 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-14 15:28:32 +09:00
Yan Chunwei
c30eead09f
[TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch ( #5956 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-14 14:09:39 +08:00
QI JUN
ce39409530
fix cancel request logic ( #5800 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-07-14 10:23:20 +08:00
wili
3dfc819849
[BUG5374319][fix] WAR for draft-target-model unit tests error ( #5958 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 23:48:57 +09:00
Mike Iovine
8950223f6f
[fix] Remove SpecConfig and fix thread leak issues ( #5931 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-12 21:03:24 +09:00
Enwei Zhu
bc1d4fb5da
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) ( #5902 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-12 15:50:31 +09:00
Chang Liu
308776442a
[nvbug/5308432] fix: extend triton exit time for test_llava ( #5971 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-12 12:56:37 +09:00
Thor Johnsen
041f1fa513
[TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora ( #5885 )
...
Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
2025-07-11 16:20:41 -07:00
xinhe-nv
509363d858
tests: update sanity tests & fix tests ( #5906 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-11 19:48:19 +10:00
brb-nv
0385f89abc
test: Fix Gemma3 unit tests due to transformers upgrade ( #5921 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-10 17:24:10 -07:00
2ez4bz
c19840235d
[fix] Fix mistral unit tests due to transformers upgrade ( #5904 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-10 10:45:27 -07:00
wili
2e3cf42e03
[refactor] Simplification of Speculative decoding configs ( #5639 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-10 11:37:30 -04:00
Yiqing Yan
3aa53ec36c
[None] - Waive L0 tests ( #5915 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-10 18:33:17 +08:00
Enwei Zhu
055c4a9fe6
[NvBug 5370718, 5371538] fix: Fix incremental detokenization ( #5825 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-10 16:30:00 +08:00
CarstyYou
dc32f9ae73
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm ( #5531 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2025-07-10 15:16:18 +08:00
Anthony Chang
7d21b55b5a
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE ( #5723 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-07-10 14:06:50 +08:00
Yan Chunwei
07f6da763d
[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner ( #5876 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-10 11:31:35 +08:00
Venky
f57b3d6829
Waive unittest failures introduced by PR#5345 (removal of ScaffoldingOutput class) ( #5886 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-07-10 09:53:31 +08:00
peaceh-nv
76c3a12bcb
[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 ( #5636 )
...
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
2025-07-10 09:20:30 +08:00
brb-nv
3209b31665
feat: Custom masking utils for Gemma3 VLM ( #5853 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-10 06:18:04 +09:00
2ez4bz
87fe44fd29
feat(models): Mistral3.1 VLM pytorch backend support ( #5529 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-09 13:17:40 -07:00
Chang Liu
b61a717275
[1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes ( #5396 )
2025-07-10 05:12:53 +09:00
Wanli Jiang
3f7cedec7c
Update transformers to 4.53.0 ( #5747 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-09 09:32:24 -07:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support ( #5029 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
Omer Ullman Argov
a32f7083b4
[ci] parallelize torch unittests ( #5714 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-07-09 11:05:57 +03:00
Dom Brown
3e3b1769ad
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner ( #5764 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-09 08:21:58 +01:00
Erin
e277766f0d
chores: merge examples for v1.0 doc ( #5736 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-08 21:00:42 -07:00
Lucas Liebenwein
d14dd2f597
[AutoDeploy] re-enable waive for flaky AD test ( #5867 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-07-09 11:47:48 +09:00
Bo Li
9d894bc0cb
fix: [ https://nvbugspro.nvidia.com/bug/5375656 ] Unwaive for bug 5375656. ( #5842 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-09 10:17:05 +08:00
brb-nv
2bd09ed2d4
fix: Skip rope scaling for local layers in Gemma3 VLM ( #5857 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-09 10:10:33 +08:00
Venky
e27215ca03
test: Validate and add accuracy& perf tests for Ministral-8B-Instruct[-FP8](pytorch only) ( #5654 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-07-08 18:16:21 -07:00
xavier-nvidia
b6013da198
Fix GEMM+AR fusion on blackwell ( #5563 )
...
Signed-off-by: xsimmons <xsimmons@nvidia.com>
2025-07-09 08:48:47 +08:00
Fridah-nv
a79b73f577
fix: [5376140] [AutoDeploy] Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test ( #5855 )
...
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-07-09 09:13:31 +09:00
Yan Chunwei
e50d95c40d
chore [TRTLLM-6161]: add LLM speculative decoding example ( #5706 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-09 07:33:11 +08:00
Pamela Peng
da8c7372d4
[TRTLLM-5366][feat]Add support for sm121 ( #5524 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.
2025-07-08 14:27:00 -07:00
Chang Liu
08a3dfeb2b
[nvbug/5308432] unwaive test: post-merge-triton_backend-test_llava ( #5814 )
2025-07-08 09:53:11 -07:00
Dom Brown
e3ccca06e1
test: reduce redundant test cases for TRTLLM Gen FP8 MoE ( #5845 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-07-09 00:40:33 +09:00
Kaiyu Xie
bb5b16fcb9
feat: Return context response immediately when stream_interval > 1 ( #5836 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-09 00:19:57 +09:00
Raayan Dhar
e3268a4221
[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg ( #5732 )
...
Signed-off-by: raayandhar <rdhar@nvidia.com>
2025-07-08 09:39:58 -04:00
Yegor
b01d1c28f7
[feat] Detokenize option in /v1/completions request ( #5382 )
...
Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com>
Signed-off-by: Yegor Yershov <yegor6741@gmail.com>
2025-07-08 19:36:04 +08:00
Yiqing Yan
ec0d7e64b9
[Infra] - Waive L0 test ( #5837 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-08 17:54:06 +08:00
xinhe-nv
89bbb230cc
tests: waive failed cases on main ( #5781 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-08 19:44:12 +10:00
nv-guomingz
c8fa08da5c
doc: update cuda_graph_config usage part in DS R1 docs ( #5796 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-08 16:54:46 +09:00
Enwei Zhu
55f86ce7ab
[NvBug 5362426] fix: Fix prompt adapter TP2 case ( #5782 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-08 16:01:36 +09:00
Venky
9258187e98
Waive some test_llama_eagle3 unittests ( #5811 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-07-08 15:35:27 +09:00
liji-nv
95978e3044
[fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage ( #5700 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-08 12:42:15 +08:00
nv-guomingz
0be41b6524
Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" ( #5818 )
2025-07-08 13:15:30 +09:00
Yechan Kim
5bc3a15f10
feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL ( #5522 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-07 18:03:12 -07:00
nv-guomingz
5a8173c121
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… ( #5795 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-08 08:52:36 +08:00
Omer Ullman Argov
1191555cce
[ci] speedup fused moe tests ( #5726 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-07-07 18:03:15 +03:00
Robin Kobus
30a19fcf7c
[TRTLLM-6291] feat: Add user-provided speculative decoding support ( #5204 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-07 16:30:43 +02:00
DylanChen-NV
5ca2b9bb15
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow ( #5615 )
...
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-07 18:04:57 +08:00
Yi Zhang
ed1b3c884a
fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests ( #5774 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-07 18:38:54 +09:00
Yan Chunwei
dfce61f4b9
[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler ( #5751 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-07 17:05:14 +08:00
xinhe-nv
ded38ebdbd
test: [CI] remove closed bugs ( #5770 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-07 18:06:07 +10:00
Bo Li
9db2e9ee47
fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. ( #5772 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-07 14:58:32 +08:00
Yanchao Lu
2013034948
[Test] - Waive or fix few known test failures ( #5769 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-07-06 21:14:16 +08:00
Stefan Niebler
d1112aac37
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow ( #5333 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-05 01:35:13 +09:00
Chuang Zhu
ffc0b8f5da
Cache transceiver support VSWA ( #5505 )
...
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-05 01:18:42 +09:00
Yiqing Yan
7f3ea058f0
[Infra] - Waive L0 flaky test ( #5759 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-04 19:25:12 +09:00
Shunkangz
32339d1b20
Raise shut down error for each request ( #4936 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-07-04 18:58:24 +09:00
xinhe-nv
3869b969a6
test: [CI] Add failed cases into waives.txt ( #5718 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-04 17:24:48 +09:00
Faraz
81c0764012
Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 ( #5724 )
...
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>
2025-07-04 16:53:20 +09:00
Yiqing Yan
b8fef809ae
[Infra] - Waive L0 test ( #5748 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-04 15:04:49 +08:00
Yuan Tong
32b244af38
feat: reduce unnecessary kernel generation ( #5476 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-07-04 14:37:49 +08:00
Emma Qiao
a0135c0f6f
[Infra] - Waive failed cases on release/0.21 ( #5674 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-04 13:14:13 +08:00
brb-nv
cdaa6abce7
fix: Investigate Gemma3 1B decoder output discrepancy ( #5564 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Yi Zhang
73d30a23c7
test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests ( #5397 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Zheng Duan
cb9f596dbe
[nvbug 5300551] test: increase block count in eviction test ( #5465 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
nv-guomingz
d0b3d2ac65
fix: https://nvbugs/5362398 ( #5609 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
Yan Chunwei
77288d3671
fix [nvbug5351244]: test_mpi_session submit sync/async ( #5608 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-04 13:14:13 +08:00
xinhe-nv
7f837b6e8b
tests: waive failures on main ( #5704 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-07-04 12:39:12 +09:00
Venky
4762e0b244
Waive tests : test_openai_lora, test_trtllm_serve_lora_example and test_openai_chat_structural_tag_example ( #5740 )
...
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-07-04 11:01:08 +09:00
Lucas Liebenwein
24ac9b5f69
[AutoDeploy] merge feat/ad-2025-06-29 ( #5737 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-07-04 10:21:18 +09:00
Netanel Haber
f91379b7e8
delete duplicate eagle3 and ngram tests ( #5711 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-07-03 15:47:26 +03:00
Omer Ullman Argov
c72856188c
[ci] small multigpu speedups ( #5643 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-07-03 08:06:10 -04:00
Emma Qiao
530897388c
[Infra] - Waive a failed case on main ( #5702 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-03 06:09:27 -04:00
tomeras91
7dbecf7272
[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H ( #5646 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-07-03 11:07:51 +03:00
Emma Qiao
2a5fdebf10
[Infra] - Waive failed tests for main 0702 ( #5671 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-02 22:05:07 -04:00
Fridah-nv
afef5127f0
feat:[AutoDeploy] E2E build example for llama4 VLM ( #3922 )
...
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-07-02 19:29:34 -04:00
Emma Qiao
31699cbeb1
[Infra] - Set default timeout to 1hr and remove some specific settings ( #5667 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-02 08:37:54 -04:00
Jhao-Ting Chen
77082cde38
[ https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op ( #5146 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2025-07-02 04:54:43 -04:00
qixiang-99
ca7b6ec8d8
Feat/pytorch vswa kvcachemanager ( #5151 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-02 15:58:00 +08:00
Yan Chunwei
2d69b55fe8
chore: enhance yaml loading arbitrary options in LlmArgs ( #5610 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-02 14:21:37 +08:00
Xiaowei Wang
32dfdfba30
feat: fuse w4a8 moe pre-quant scale on Hopper ( #5613 )
...
Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2025-07-01 23:02:41 -04:00
HuiGao-NV
10c50515c2
fix: Add back allreduce_strategy parameter into TorchLlmArgs ( #5637 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-07-02 09:49:20 +08:00
Aurelien Chartier
fa95e402a5
feat: add LLmArgs option to force using dynamic quantization ( #5346 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-07-01 12:16:09 -07:00
liji-nv
c345f5876c
[feat] Support torch compile for attention dp ( #5086 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-07-01 13:48:52 -04:00
Kaiyu Xie
f9a455651b
perf: Use tokenizers API to optimize incremental detokenization perf ( #5574 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-01 09:35:25 -04:00
Yan Chunwei
3bc703d450
ci: unwaive llmapi launch test ( #5281 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Emma Qiao
178fc3f655
[Infra][release/0.21] - waive failed tests ( #5537 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-01 20:12:55 +08:00
Yan Chunwei
ee7fcbf20e
[nvbug 5273941] fix: broken cyclic reference detect ( #5417 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
ruodil
ded203d8aa
test: set enable_attention_dp=True in default deepseek settings ( #5461 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
brb-nv
4ef60d5fbb
nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 ( #5453 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Ivy Zhang
61213e3562
tests: fix typos in qa test ( #5421 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-07-01 20:12:55 +08:00
Yan Chunwei
a5eff139f1
[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) ( #5431 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-01 19:06:41 +08:00
Emma Qiao
65c2b93284
[Infra] - Add some timeout and unwaive a test which dev fixed ( #5631 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-07-01 05:01:32 -04:00
Pamela Peng
071ad758c4
[ https://nvbugs/5318059 ][test] Unwaive test ( #5624 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-07-01 04:54:44 -04:00
Robin Kobus
5f77d212ef
test: Reduce number of C++ test cases ( #5437 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-01 09:40:49 +02:00
danielafrimi
7a617ad1fe
feat: W4A16 GEMM ( #4232 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-07-01 10:36:05 +03:00
xinhe-nv
19c56f0374
test: [CI] Add failed cases into waives.txt ( #5582 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-07-01 14:57:03 +08:00
Stanley Sun
7135b27284
rcca: test default kv_cache_reuse option for pytorch multimodal ( #5544 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-07-01 12:12:48 +08:00
xinhe-nv
a8cf611baa
test: [CI] Add failed cases into waives.txt ( #5569 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-07-01 11:02:56 +08:00
xinhe-nv
9b17b29b6e
test: [CI] remove closed bugs ( #5572 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-07-01 10:15:43 +08:00
Yi Zhang
7cf1209a19
[fix]: Fix main test skip issue ( #5503 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-30 21:39:49 -04:00
Wei-Ming Chen
f28cd3056e
feat: AutoDeploy fp8 quantization support for bmm ( #3849 )
...
Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
2025-06-30 12:36:34 -04:00
nv-guomingz
6e48ac25a6
chore: remove cuda_graph_ prefix from cuda_graph_config filed members. ( #5585 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 12:23:14 -04:00
Yan Chunwei
98a7c24062
chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs ( #5595 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-30 20:40:23 +08:00
Omer Ullman Argov
42134b8b84
[ci] move eagle1 and medusa tests to post-merge ( #5604 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-30 19:32:28 +08:00
Fanrong Li
6cbc9a5297
[nvbug/5354946][fix] Fix mtp vanilla draft inputs ( #5568 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-30 15:59:12 +08:00
WeiHaocheng
42a9385d02
[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare ( #5570 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-06-30 13:06:09 +08:00
Omer Ullman Argov
1db63c2546
[fix] speedup modeling unittests ( #5579 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-30 06:30:45 +03:00
Yiqing Yan
4fef14da56
Deduplicate waive list ( #5546 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-30 11:12:26 +08:00
nv-guomingz
578430e64c
[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) ( #5014 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-30 11:05:40 +08:00
Omer Ullman Argov
2780fc27a7
[ci] remove MMLU if followed by GSM8K ( #5578 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-30 05:29:54 +03:00
Cheng Hang
64db7d27f6
[feat] Optimizations on weight-only batched gemv kernel ( #5420 )
...
Signed-off-by: Cheng Hang <chang@nvidia.com>
2025-06-30 10:20:16 +08:00
Omer Ullman Argov
94dc97ab10
[feat][test] reuse MPI pool executor across tests ( #5566 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-29 17:23:12 +03:00
tomeras91
a1c1c6b504
[CI] reduce mamba2 ssm test parameterization ( #5571 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-29 15:56:23 +03:00
Talor Abramovich
70e34a3291
[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve ( #5376 )
...
Signed-off-by: Talor Abramovich <talora@nvidia.com>
2025-06-29 12:46:30 +00:00
amirkl94
a985c0b7e6
tests: Move stress tests to be Post-Merge only ( #5166 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-06-29 09:44:47 +03:00
Emma Qiao
9db769ee62
[Infra] - Add import pytest ( #5565 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-29 11:06:14 +08:00
Lucas Liebenwein
619709fc33
[AutoDeploy] merge feat/ad-2025-06-13 ( #5556 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-29 03:52:14 +08:00
Li Min
6021a439ab
Make moe permute and final as custom op ( #5412 )
...
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-06-27 15:48:33 -07:00
Iman Tabrizian
26b953e29a
[nvbugs/5309940] Add support for input output token counts ( #5445 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-28 04:39:39 +08:00
Darragh Hanley
5437075def
ReDrafter support for Qwen ( #4875 )
...
Signed-off-by: darraghdog <darragh.hanley@gmail.com>
Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
2025-06-28 02:33:10 +08:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI ( #5497 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
wili
56cdfe5c6c
[TRTLLM-5000][feat] NGrams V2 ( #4569 )
...
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-06-27 23:00:17 +08:00
Omer Ullman Argov
6fc1c6fd7b
[fix][ci] correct unittests test prefix ( #5547 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-27 20:34:44 +08:00
Enwei Zhu
7f1893f54c
ci: waive flaky test test_llama_eagle3 ( #5548 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-27 19:16:07 +08:00
Emma Qiao
980030c816
[Infra] - Waive failed case in post-merge ( #5536 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-27 13:55:49 +08:00
Iman Tabrizian
49af791f66
Add testing for trtllm-llmapi-launch with tritonserver ( #5528 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-27 11:19:52 +08:00
xinhe-nv
a3494bebec
tests: waive failed tests on main ( #5512 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-27 10:13:22 +08:00
Yibin Li
0f3bd7800e
[TRTLLM-4971]: Use safe deserialization in ParallelConfig ( #4630 )
...
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-27 09:58:41 +08:00
Frank
aa6e015ef8
Update trtllm-bench to support new Pytorch default. ( #5491 )
...
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-06-26 17:05:43 -07:00
jmydurant
8836990bde
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) ( #5475 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 22:18:08 +08:00
Robin Kobus
8dfa31c71d
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead ( #5384 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-26 19:45:52 +08:00
Omer Ullman Argov
6bae76d7ca
[fix][ci] move torch tests to run under torch stage ( #5473 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 14:31:38 +03:00
Omer Ullman Argov
1633bd2bef
[CI] move flashinfer llama tests to post merge ( #5506 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 19:27:32 +08:00
xinhe-nv
ff2dd72df4
tests: waive tests ( #5458 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-26 14:53:55 +08:00
Bo Li
1bab9000a6
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf ( #5318 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-26 14:03:56 +08:00
dongxuy04
490d2e5819
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) ( #5226 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 22:25:13 -07:00
Emma Qiao
32d1573c43
[Infra] - Add timeout setting for long tests found in post-merge ( #5501 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-26 11:31:39 +08:00
Venky
d9b75f83fd
[CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] ( #5494 )
...
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-25 20:17:12 -07:00
jmydurant
578dbc8d9a
feat: chunked prefill for MLA (Blackwell) ( #4651 )
...
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-06-26 09:01:00 +08:00
HuiGao-NV
74ae15a26b
CI: enable test cases on single device type ( #5484 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-26 08:03:44 +08:00
QI JUN
feaf789342
CI: reduce BF16 test cases in B200 ( #5482 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 07:18:20 +08:00
Omer Ullman Argov
bdc8dfebc3
[fix][ci] dont build wheel for cpp tests ( #5443 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-26 00:13:47 +03:00
Omer Ullman Argov
61bb71fd1b
[fix][test] remove test in global scope ( #5470 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-25 23:42:26 +03:00
QI JUN
3a2c4ca77b
chore: split _build_model method for TorchLlm and TrtLlm ( #5418 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-06-26 04:32:46 +08:00
Daniel Cámpora
205c97a4ae
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler ( #5328 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-25 17:41:36 +02:00
HuiGao-NV
314f15f0a7
Fix: fix nvbug 5356427 ( #5464 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 22:24:26 +08:00
HuiGao-NV
cc3c2b3be2
Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device ( #5457 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 21:38:14 +08:00
Kaiyu Xie
d6ada5ffce
[nvbug/5354956] fix: unexpected keyword argument 'streaming' ( #5436 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-25 20:37:24 +08:00
QI JUN
2901c5a5bc
CI: waive test_ad_build_small_multi ( #5471 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 16:44:42 +08:00
Netanel Haber
3ca2f6ac51
start OAIServer with max_beam_width=1 for TorchSampler ( #5427 )
...
Signed-off-by: Netanel Haber <nhaber@nvidia.com>
2025-06-25 15:52:06 +08:00
Enwei Zhu
fc7a81ceb0
test: Add LLGuidance test and refine guided decoding ( #5348 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 14:12:56 +08:00
Enwei Zhu
76da7fed86
fix (NvBug 5354925): Fix static EPLB ( #5411 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-25 13:14:40 +08:00
HuiGao-NV
da98e03747
tests: Set kv cache free memory fraction in test case ( #5433 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-25 12:31:58 +08:00
Shunkangz
d5354897c0
feat: Dynamically remove servers in PD ( #5270 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-25 09:50:04 +08:00
Lucas Liebenwein
5cffb7e0ec
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch ( #5454 )
...
Signed-off-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
Co-authored-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-25 09:30:13 +08:00
QI JUN
241f921800
waive test_moe.py::test_moe_fp8[autotune] ( #5455 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-25 09:14:44 +08:00
dongxuy04
699520082b
Add MTP support for Online EPLB ( #5213 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-06-25 07:58:13 +08:00
Iman Tabrizian
846bbf1edc
Fix test Pytorch model engine ( #5416 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-24 11:09:27 -07:00
HuiGao-NV
35a92f6bab
Add debug hook to support dump tensor data and add new debug functions easily ( #5182 )
...
Signed-off-by: Hui Gao
2025-06-24 17:45:28 +08:00
Emma Qiao
475272046a
[Infra] - Waive failed tests in post-merge and increase some timeout setting ( #5424 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-24 17:19:31 +08:00
xinhe-nv
658fb5b54e
tests: update benchmark test lists ( #5365 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-24 15:23:38 +08:00
xinhe-nv
4b32a3f1a7
test: [CI] remove closed bugs ( #5400 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-24 13:39:57 +08:00
Robin Kobus
b3045c44b9
refactor: remove TrtGptModelOptionalParams ( #5165 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-20 10:31:40 +02:00
Fanrong Li
5d4ab47d5b
fix: refactor and fix mtp vanilla ( #4762 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-20 05:23:39 +08:00
Yan Chunwei
9bd42ecf9b
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default ( #5312 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-20 03:01:10 +08:00
Kaiyu Xie
7246fd75d1
feat: Support stream_interval ( #5284 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-19 21:57:10 +08:00
Enwei Zhu
bca758fce1
fix: Fix DS-R1 nvfp4 test case naming ( #5361 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-19 15:50:43 +08:00
Emma Qiao
493f268b1c
[Infra]Fix l0_sanity_check.yml which also has gb202 and gb203 ( #5360 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-19 15:05:57 +08:00
hlu1
b558232ce1
Refactor CutlassFusedMoE ( #5344 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-06-19 00:04:07 -07:00
ruodil
e22e884b02
test: amend test case name in perf cluster test ( #5356 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-06-19 14:50:12 +08:00
ruodil
21ce9b6749
test: add qwen3 cases ( #5302 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-19 14:38:36 +08:00
amitz-nv
1753202b61
[TRTLLM-5825][fix] Fix torch LoRA TP ( #5338 )
...
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-19 09:12:00 +03:00
Emma Qiao
7f68de3e3f
Refactor test timeout for individual long case ( #4757 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-19 13:52:11 +08:00
bhsueh_NV
dce8620013
chore: enable moe_backend on Qwen3 test ( #5230 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-06-19 13:40:45 +08:00
xinhe-nv
e5400eeae0
tests: add ds r1 tp4 test ( #5197 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-19 12:48:33 +08:00
Yiqing Yan
da576bcafa
Waive L0 test ( #5349 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-19 12:01:11 +08:00
Fanrong Li
6c3210a8be
[test] add nvfp4 DeepSeek-V3-Lite-mtp tests ( #5125 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-19 09:48:22 +08:00
nv-guomingz
6a388b105a
chore: remove torch_compile prefix for TorchCompileConfig field members ( #5261 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-19 09:21:51 +08:00
Yan Chunwei
3946e798db
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances ( #4727 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-19 06:13:53 +08:00
Omer Ullman Argov
0b6d005ef6
[fix][test] clear cuda cache before unittests automatically ( #5121 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-19 00:36:53 +03:00
Aurelien Chartier
d25f93c07f
chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head ( #5293 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-18 11:13:12 -07:00
Omer Ullman Argov
5010f8719d
[fix][test] remove duplicate test runs ( #5241 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-19 01:59:54 +08:00
Omer Ullman Argov
a28a152001
[fix][test] remove some cpp test cases from h100 ( #5335 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-18 20:40:26 +03:00
yuanjingx87
a1c5704055
[feat] Multi-node CI testing support via Slurm ( #4771 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: yuanjingx87 <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-19 01:11:12 +08:00
Iman Tabrizian
e5ee5c5352
Unwaive disaggregated serving accuracy tests ( #5095 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
2025-06-19 00:41:15 +08:00
HuiGao-NV
d13d2f460d
Remove duplicated test cases ( #5323 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Hui Gaoâ <huig@nvidia.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-18 21:20:20 +08:00
Emma Qiao
b29ac5b561
[Infra] Update 5080 and 5090 case condition due to the driver update ( #5317 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-18 20:01:36 +08:00
xinhe-nv
610a49f117
tests: add multi nodes tests ( #5196 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-18 18:08:04 +08:00
Yi Zhang
375dd0b971
Waive L0 ( #5311 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-18 16:40:41 +08:00
Yuan Tong
f599ee63c1
test: correct unittest rerun behavior ( #5273 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-06-18 16:37:19 +08:00
Robin Kobus
38547b92f3
refactor: Introduce ResourceManagerType enum for resource management ( #5246 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 09:55:59 +02:00
Wanli Jiang
3a02489e86
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support ( #5159 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-06-18 15:12:49 +08:00
QI JUN
9ea7bb67a4
CI: fix TensorRT H200 tests ( #5301 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-18 14:40:57 +08:00
ruodil
3b5d916250
test: cherry-pick deepseek rcca cases in main branch ( #5307 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-18 14:26:26 +08:00
Yiqing Yan
8f67e3604d
Waive L0 tests ( #5308 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-18 12:43:45 +08:00
Omer Ullman Argov
f501ce57b1
[fix][test] move deepseek single gpu tests to post merge ( #5280 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-18 06:59:39 +03:00
dominicshanshan
3c0fecbf42
CI: extend model weights load time for dsv3 in stress test. ( #5275 )
...
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-06-18 11:51:48 +08:00
Ivy Zhang
41cfcaa964
test: update qa test list ( #5305 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-18 11:29:11 +08:00
Emma Qiao
ff32caf4d7
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 ( #4885 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-17 23:48:34 +08:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind ( #5224 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Yanchao Lu
f4cdbfcdf0
None - Some clean-ups for the automation pipeline ( #5245 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-17 21:08:24 +08:00
Dom Brown
44fb3c1673
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner ( #5207 )
...
- Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning.
- Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution.
- Extends the unit test to run both autotuned and non-autotuned code paths.
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-17 21:01:56 +08:00
amirkl94
8451a87742
chore: Mass integration of release/0.20 ( #5082 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-17 14:32:02 +03:00
liji-nv
13eef642e6
[feat] Piecewise cuda graph support for MLA ( #4467 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-17 18:58:38 +08:00
QI JUN
ccd9adbe33
CI: move multi-gpu test cases of tensorrt backend to h200 ( #5272 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 17:37:37 +08:00
Ivy Zhang
2ad8758ecc
[TRTLLM-5786][ https://nvbugspro.nvidia.com/bug/5310520 ][test] Add QA test cases ( #5073 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-17 17:14:01 +08:00
QI JUN
517c1ecf72
move some test cases of TensorRT backend back ( #5232 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 17:03:11 +08:00
qsang-nv
134cb66a53
fix mla test ( #5240 )
...
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-06-17 15:26:25 +08:00
xinhe-nv
a49ad790b3
test: [CI] remove closed bugs ( #5218 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-17 13:13:23 +08:00
QI JUN
546274d40e
fix ci ( #5259 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 12:03:09 +08:00
ruodil
bb2348372c
test: add more pytorch cases in perf test ( #5237 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-06-17 11:11:28 +08:00
Mike Iovine
c53bc19f5e
[infra] Make test_chunked_prefill faster ( #5248 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-17 04:19:47 +08:00
Simeng Liu
5c18160d27
chore: Waive CI failure. ( #5252 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-06-16 20:47:05 +02:00
Izzy Putterman
e607768e45
Speculation: Draft Target in new FW ( #4558 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
2025-06-17 02:26:08 +08:00
Yilin Fan
dd29063538
[feat] Add llm args to tune python gc threshold ( #5141 )
...
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 17:45:22 +08:00
Ivy Zhang
64b7f04fdc
[test] split nemotron test cases from examples_test_list ( #5238 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-16 16:36:33 +08:00
xinhe-nv
802f22cd12
test: [CI] Add failed cases into waives.txt ( #5221 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-16 16:11:53 +08:00
Yiqing Yan
8445416c39
Waive L0 tests ( #5233 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-16 15:19:03 +08:00
Anthony Chang
4f9fa9f21d
feat: MoE trtllm backend kernel update ( #5183 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-16 14:46:13 +08:00
Wanli Jiang
0acf23185e
[Stress test] Add DeepSeek-R1 stress test ( #5033 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-06-16 11:54:31 +08:00
Tracin
ef3fdc8051
feat: Add w4a8_mxfp4_fp8 quantization recipe. ( #4867 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-06-16 11:30:57 +08:00
Yi Zhang
9b616db13b
test: Add fixture to skip tests based on MPI world size ( #5028 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-16 11:25:01 +08:00
ruodil
2848e012ae
test: add llama4 models for perf test ( #5187 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-16 11:24:35 +08:00
ruodil
3d22f27063
test: add more cases for llama_v3.3/3.1 70b fp8 and set enable_attention_dp to false to non-deepseek models ( #5155 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-06-16 11:23:20 +08:00
Enwei Zhu
babdd9ce06
test: Add json_mode_eval for guided decoding evaluation ( #5179 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-16 10:03:55 +08:00
Yan Chunwei
c84e41fd9d
fix: build_config in TorchLlmArgs and avoid arbitrary args ( #4972 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-15 17:51:56 -07:00
amitz-nv
109c426077
Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow ( #5130 )
2025-06-15 18:54:04 +03:00
Omer Ullman Argov
4eade3ae33
[fix][test] Speedup Nemotron NAS unittests ( #5202 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-15 11:26:03 +03:00
Kaiyu Xie
dce1dcc4f9
feat: Support post_proc for bench ( #5122 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-15 13:02:38 +08:00
ixlmar
e055af1bc9
chore: improve disagg test failure detection ( #4738 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-15 01:28:26 +08:00
Aurelien Chartier
1389f5a4d3
feat: Add support for fp8 rowwise quantization ( #4876 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>
2025-06-14 06:37:48 -07:00
Tailing Yuan
0b60da2c45
feat: large-scale EP(part 7: DeepEP integration) ( #4792 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-14 19:12:38 +08:00
yunruis
b99c5ce8c1
Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL ( #4560 )
...
Signed-off-by: yunruis <yunruis@nvidia.com>
Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>
Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>
2025-06-14 17:36:22 +08:00
nv-guomingz
3b7b5a5ad5
refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) ( #5032 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-14 14:23:13 +08:00
Enwei Zhu
5f2785fb90
fix: Fix waive list ( #5205 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-13 23:33:23 +08:00
Mike Iovine
25aa3881d7
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len ( #4879 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-13 11:06:36 -04:00
QI JUN
952f33dcad
CI: move all test cases of TensorRT backend into post merge ( #5186 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-13 20:48:48 +08:00
xinhe-nv
30d9d0fa71
test: [CI] Add failed cases into waives.txt ( #5178 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-13 16:38:51 +08:00
Zheng Duan
4d0a5ad384
chore: gracefully exit disagg process in tests; better startup and logging ( #5109 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-13 14:03:55 +08:00
Ivy Zhang
28cd536bd6
[test] Update timeout params in QA test list ( #5124 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-13 13:40:03 +08:00
Iman Tabrizian
01bd4c00b4
Add two MTP disaggregated test ( #4546 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-13 12:17:45 +08:00
Daniel Cámpora
dec326ba7d
[fix] Reenable test return logits ( #5160 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-13 06:07:22 +02:00
Yibin Li
b79eb34bfe
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn ( #5074 )
...
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-13 11:37:50 +08:00
xinhe-nv
d9be419f45
tests: update tests for b200 ( #5180 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-13 11:25:33 +08:00
ruodil
fa582cbe9a
test: add more cases for rtx_pro_6000_se and add option kv_cache_dtype in perf test ( #5083 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-06-13 11:09:15 +08:00
Yuxian Qiu
4ae46b6714
fix: [nvbugs/5324229] Fix broken WInt4AFP8FusedMoEMethod since FusedMoE refactor. ( #4930 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-13 10:21:32 +08:00
Fanrong Li
38a907aaca
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance ( #5119 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-13 08:58:44 +08:00
Matthias Jouanneaux
a0b6c635b1
[feat] trtllmGen MoE routing: added support for top groups and top K bounds ( #4063 )
...
Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2025-06-13 06:00:02 +08:00
Omer Ullman Argov
655bce0b19
[fix][test] report individual unittests results to jenkins ( #5116 )
...
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
2025-06-13 01:52:09 +08:00
HuiGao-NV
dfeeaf6746
Move allreduce_strategy from committed api to reference ( #5147 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 21:00:20 +08:00
nv-guomingz
cf35a079f9
fix: https://nvbugs/5298661 ( #5022 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-12 20:41:44 +08:00
Shi Xiaowei
88cba5f354
test: waive the NIXL related tests ( #5153 )
...
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-12 17:02:27 +08:00
Fanrong Li
4d070d3862
chore: fix typo in tests ( #5092 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-12 15:11:26 +08:00
Michal Guzek
53983ad273
[TRTLLM-4932] Add Llama-3.1-Nemotron-Nano-8B-v1-FP8 accuracy tests ( #4933 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-06-12 15:06:28 +08:00
ruodil
d021cc5126
test: set enable_attention_dp to False for non-deepseek models and add more cases for llama_v3.1/3.3 70b fp8 models ( #5149 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-12 14:59:16 +08:00
tomeras91
06d9f1e2f6
[test] Use LLM API for Nemotron-H correctness test ( #5097 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-12 09:54:46 +03:00
bhsueh_NV
505678a286
update the free_gpu_mem_fraction for H100 qwen3 qa test ( #5114 )
...
Signed-off-by: root <root@eos0274.eos.clusters.nvidia.com>
Co-authored-by: root <root@eos0274.eos.clusters.nvidia.com>
2025-06-12 14:40:57 +08:00
Michal Guzek
0daa70999a
Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs ( #4961 )
...
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-06-12 14:32:04 +08:00
Venky
c3b2eb6dab
test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ ( #5066 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-06-12 14:19:15 +08:00
Lucas Liebenwein
49d7268acc
[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade ( #5106 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-12 13:07:27 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA ( #4667 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
HuiGao-NV
43192379af
Use backend to replace macro to control enablement of MNNVL all reduce ( #4635 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-06-12 11:22:49 +08:00
xinhe-nv
11b94feff8
test: skip disaggregated tests on arm ( #5070 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-11 17:00:10 +08:00
ruodil
56abae0835
test: add more llama_v3.3_70b cases in perf test ( #4979 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-11 15:44:22 +08:00
Daniel Cámpora
fdf1c47d1d
[TRTLLM-4995][feat] TRTLLM Sampler log probs support ( #4836 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-11 08:18:13 +02:00
Yiqing Yan
0a9f105931
Waive L0 tests ( #5111 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-11 11:53:15 +08:00
ChristinaZ
273c6b9355
[ https://nvbugspro.nvidia.com/bug/5332927 ][fix] Fix the bug in the routing unit test ( #5065 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-11 09:44:35 +08:00
Zheng Duan
580a92521e
test: conditional disagg and cache aware balancing for deepseek v3 ( #4522 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-11 09:44:29 +08:00
Bo Li
1b79041f5d
fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. ( #4264 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-06-11 09:38:10 +08:00
Mike Iovine
fcd71921f1
[fix] Unwaive test_llama_eagle3 ( #5042 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-10 18:11:07 -04:00
Jinyang Yuan
194a708d83
[fix] Fix test_attention_mla ( #5084 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-06-10 14:20:11 -07:00
nvpohanh
7b210ae9c3
test: add unit tests for Llama4 min_latency code ( #4980 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-06-10 12:10:26 -07:00
Lucas Liebenwein
7ddc4d6282
[AutoDeploy] Merge Feature Branch Week 3 ( #5054 )
...
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
2025-06-11 00:20:43 +08:00
Tracin
6c91f1c7ac
Mxfp8xmxfp4 quant mode( #4978 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-10 22:01:37 +08:00
liji-nv
f6a49a9343
[CI] waive failing L0 test ( #5089 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-10 20:40:44 +08:00
Zongfei Jing
6d1f2d0fd7
[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion ( #4756 )
...
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-06-10 19:55:16 +08:00
Yiqing Yan
8ec8e4559d
Waive L0 test ( #5077 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-10 16:23:49 +08:00
tomeras91
f121f13ddf
[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness ( #4954 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-10 11:09:37 +03:00
Yiqing Yan
fdfc711261
Waive L0 test ( #5067 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-10 15:40:57 +08:00
QI JUN
12ffdcbf53
CI: waive test_ad_build_small_multi ( #5071 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-10 14:54:05 +08:00
Simeng Liu
86959ef1e4
chore: Waive CI failure. ( #5069 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-06-10 14:04:10 +08:00
Stanley Sun
74b0e71ef4
test: add more disaggregated serving tests into QA testlist ( #5036 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-06-10 09:24:53 +08:00
tburt-nv
e2bd01fa18
[ https://nvbugs/5332927 ] Waive new tests ( #5051 )
...
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-06-10 05:17:54 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) ( #4145 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Yuxian Qiu
e79527d195
chore: Refine weight prefetching. ( #4893 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 21:24:16 +08:00
pcastonguay
5b84fd9201
[nvbug 5283506] fix: Fix spec decode triton test ( #4845 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-09 08:40:17 -04:00
Mike Iovine
f4d9c87c51
[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… ( #4843 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-09 08:31:35 -04:00
Yukun He
137fe35539
fix: Fix warmup phase batch size out of range. ( #4986 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-09 19:19:16 +08:00
Yuxian Qiu
88480197da
ci: [nvbugs/5280806] Unwaive unittests/_torch. ( #4951 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-09 19:04:11 +08:00
Dom Brown
9c012d5bf8
[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner ( #4872 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-06-09 11:02:48 +01:00
liji-nv
1d4f748773
[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … ( #5017 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-09 17:50:57 +08:00
ChristinaZ
f45aff2b7d
Add customized renormalized moe routing kernel for moe cutlass backend ( #4955 )
...
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
2025-06-09 17:38:50 +08:00
Yiqing Yan
6b17dff2f1
Waive L0 test ( #5024 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-09 16:03:15 +08:00
Yan Chunwei
f4bfb8e49d
ci: unwaive llmapi launch test ( #4991 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-09 13:25:43 +08:00
amitz-nv
77e8d739f1
[TRTLLM-4987][feat] Support generation logits in TRTLLMSampler ( #4819 )
2025-06-09 06:30:01 +03:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way ( #4799 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
nv-guomingz
78472339b3
fix: https://nvbugs/5324252 ( #4925 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-09 01:15:45 +08:00
Omer Ullman Argov
8731f5f14f
chore: Mass integration of release/0.20 ( #4898 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Hui Gao <huig@nvidia.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: HuiGao-NV <huig@nvidia.com>
Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-08 23:26:26 +08:00
Mike Iovine
ec0d984656
[nvbug/5280806][fix] Fix 2 model spec decode flow ( #4807 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-08 07:40:02 -04:00
Yanchao Lu
9e05613679
[Infra] - Update JNLP container config ( #5008 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-06-08 16:44:09 +08:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) ( #4818 )
...
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
QI JUN
5ee0de7f2a
Resubmit #4894 ( #4969 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-08 04:42:15 +08:00
Ivy Zhang
7dce328ad6
[TRTLLM-5692][tests] Add speculative decoding test cases on torch flow ( #4940 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Ruodi Lu <ruodil@nvidia.com>
Co-authored-by: Ruodi Lu <ruodil@nvidia.com>
2025-06-07 11:18:32 +08:00
nv-guomingz
0c7dd660d8
fix: https://nvbugs/5324248 ( #4973 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-06-07 04:14:07 +08:00
Fanrong Li
75d020cf07
fix: fix cuda graph padding for spec decoding ( #4853 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-06 22:21:42 +08:00
Anthony Chang
eeb555e37b
chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM ( #4826 )
...
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
2025-06-06 16:13:54 +08:00
QI JUN
1b963c17c0
CI: waive test_llm_multi_node_with_postproc ( #4977 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-06 14:19:56 +08:00
xinhe-nv
564472168e
test: [CI] Add failed cases into waives.txt ( #4966 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-06 10:30:15 +08:00
QI JUN
ec50684d80
Revert "fix a bug of global cuda graph dummy request" ( #4970 )
2025-06-06 08:54:45 +08:00
QI JUN
154f7cc40a
fix a bug of global cuda graph dummy request ( #4894 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 19:47:40 +08:00
Yiqing Yan
7e921c78b5
Waive L0 tests ( #4953 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-05 19:36:48 +08:00
Shunkangz
3eae58ca36
Add disaggregated unittest ( #4899 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-05 19:14:31 +08:00
ixlmar
a1526356aa
[TRTLLM-5630] restore free_gpu_memory_fraction=0.9 in tests ( #4859 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-05 10:46:29 +01:00
QI JUN
b8c5e3892b
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" ( #4949 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 17:43:30 +08:00
QI JUN
d5a8079eb6
Revert "[infra] Unwaive unittests/_torch" ( #4950 )
2025-06-05 17:21:07 +08:00
Lucas Liebenwein
743fb0a159
[AutoDeploy] _AutoDeployLlmArgs as primary config object ( #4891 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-05 17:20:55 +08:00
QI JUN
91e8d43d66
CI: waive test_llm_get_queued_stats ( #4945 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 16:44:56 +08:00
xinhe-nv
1c3091c63b
tests: [TRTQA-2906] add benchmark serving tests ( #4901 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-05 14:33:03 +08:00
Netanel Haber
ddbaa5ef80
Only pass fast_build=true to non-pytorch backend ( #4920 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-05 13:30:17 +08:00
Yiqing Yan
9ceef983c0
Waive L0 tests ( #4927 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-06-05 11:09:01 +08:00
xinhe-nv
50a74a1daa
tests: fix 5273697 ( #4685 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-06-05 10:39:21 +08:00
Shiyu Li
b0d287c9b7
[TRTLLM-4647][fix] Fix the no fusion allreduce hanging ( #4594 )
...
Signed-off-by: Shiyu Li <shili@nvidia.com>
2025-06-04 18:26:13 -07:00
Mike Iovine
8433091630
[infra] Unwaive unittests/_torch ( #4919 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-05 08:49:37 +08:00
Lucas Liebenwein
f9d45e03a4
[AutoDeploy] deprecate CI post-merge tests and keep them for local testing ( #4892 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-05 08:27:17 +08:00
Yan Chunwei
8e0d96fcc6
fix: LLM invalid arg in a test ( #4922 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-05 08:00:32 +08:00
Yuxian Qiu
6b3242654e
fix: Fix broken vanilla moe since FusedMoE refactor. ( #4897 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-05 03:56:41 +08:00
Yi Zhang
1fca654bfd
tests: Update gb200 test case ( #4754 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-04 18:49:20 +08:00
tomeras91
8d31e16877
[TRTLLM-4923][feat] Paged mamba cache ( #4822 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-04 09:27:08 +03:00
Omer Ullman Argov
e71de2a13e
chore: Mass integration of release/0.20. ( #4871 )
...
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-06-04 14:12:27 +08:00
Yan Chunwei
ac20159d32
fix: build_config in TorchLlmArgs and avoid invalid args ( #4600 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-04 13:17:29 +08:00
Yukun He
5fa6fbd989
feat: Enhance AutoTuner inference path and code readability ( #4466 )
...
Fix AutoTuner warmup request generating.
* The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases.
Refactor AutoTuner API and reduce host overhead.
Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition.
* Refine cache key assembly process to reduce host overhead and simplify API.
* Fix lru_cache usage to reduce host overhead.
* Move tuning config initialization as a one-time object in tunable runner to reduce host overhead.
Improve tuning config readability.
* Use dataclass to define tuning config.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-06-04 10:53:11 +08:00
Shi Xiaowei
b13f8c9cba
Fix: NVBug 5302895 ( #4835 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-06-04 09:31:39 +08:00
Mike Iovine
73389d6531
[fix] Fix llama 4 long context ( #4809 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-06-04 07:48:08 +08:00
Nikita Korobov
8043d7a03c
feat: update DeepSeek FP8 TRT-LLM Gen cubins ( #4643 )
...
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-06-03 14:07:54 -07:00
rakib-hasan
d0eb47d33a
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation ( #4506 )
...
* refactoring the multimodal input prep
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* adding out-of-tree override option
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* adding exceptional case for llava-next
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* fixing typo
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing review comments, adding placement option, handling tokenizer variations
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing pytest-asyncio behavior change
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
---------
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-06-03 12:02:07 -07:00
Simeng Liu
2384655c3a
chore: Waive examples/test_mistral.py::test_llm_mistral_v1_1gpu. ( #4873 )
...
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-06-03 14:45:14 -04:00
pcastonguay
01f29ce38b
[nvbug 5294316] fix: Fix queued request stats ( #4714 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-03 08:33:08 -04:00
Shunkangz
ae9a6cf24f
feat: Add integration of etcd ( #3738 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Batsheva Black <bblack@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-06-03 20:01:44 +08:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling ( #4532 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
hlu1
320195dc0d
[Architecture] Refactor FusedMoE ( #4790 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-06-03 14:02:19 +08:00
Iman Tabrizian
141467d4b6
Add pre-merge Triton backend tests ( #4842 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-06-03 00:47:58 -04:00
ruodil
fa93eeee84
shorten reqs in con:1 cases and add streaming cases, and add l2 perf … ( #4849 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-03 12:28:13 +08:00
Ivy Zhang
8686868531
tests: [TRTQA-2905] improve timeout report for qa test cases ( #4753 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-06-03 12:27:27 +08:00
Robin Kobus
e34a1beb72
[nvbugs/5303555] ci: unwaive test_fp8_block_scales_cuda_graph_padding ( #4735 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 10:40:43 +08:00
Fanrong Li
380a5d1690
[ https://nvbugs/5271281 ][fix] fix a pd+mtp accuracy issue ( #4536 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-03 10:03:34 +08:00
Fanrong Li
13f68338d2
fix: [ https://nvbugspro.nvidia.com/bug/5273945 ] Unwaive tests for bug-5273945 ( #4832 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 22:01:57 +08:00
Yanchao Lu
8166649d03
[Infra] - Minor clean-up and test Ubuntu mirrors ( #4829 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-02 20:18:20 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) ( #4695 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Fanrong Li
7d356efc7d
fix: fix accuracy and illegal memory access issues when using mtp + attention dp ( #4379 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-06-02 00:35:52 +08:00
tomeras91
bf9cd11fd4
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H ( #4494 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-06-01 13:56:44 +03:00
amirkl94
8039ef45d3
CI: Performance regression tests update ( #3531 )
2025-06-01 09:47:55 +03:00
Lucas Liebenwein
491a09b0c6
[AutoDeploy] Increased Model Coverage Mass Migration Week 2 ( #4817 )
...
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-06-01 14:40:29 +08:00
Emma Qiao
202813f054
Check test names in waive list ( #4292 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-06-01 14:39:30 +08:00
Enwei Zhu
0087bd27ba
[fix] Fix SamplingParams check on n and best_of ( #4655 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-01 09:11:55 +08:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler ( #4538 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Dom Brown
338d6e9f95
[nvbug 5305210] fix: Resolve nvbug 5305210 ( #4759 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-31 19:21:06 +08:00
Yan Chunwei
93c0632ee4
opt: the perormance for dist-agg streaming generation ( #4214 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-31 17:40:32 +08:00
Emma Qiao
c945e92fdb
[Infra]Remove some old keyword ( #4552 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-31 13:50:45 +08:00
Zheng Duan
54200ee8ac
fix: random fail of cache router test ( #4597 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-30 16:28:19 +08:00
Enwei Zhu
ee916da8f1
test: Waive test_llm_loading_from_ckpt_for_tp2 ( #4797 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-30 15:43:00 +08:00
xinhe-nv
53794b26f8
test: skip test_llm_hf_gemma_quantization_1gpu_vswa on A100 ( #4779 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-30 15:12:12 +08:00
Aurelien Chartier
36b87b8671
chore: fix llm_root when LLM_ROOT is not set ( #4741 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-29 19:44:34 -07:00
Jinyang Yuan
5339d367ce
[perf] Reduce the workspace size of FP4 activation scales for MoE ( #4303 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-30 09:03:52 +08:00
Yilin Fan
31bb650298
Cherry pick feat/llama4 to main ( #4739 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-05-30 05:28:40 +08:00
Jhao-Ting Chen
fcadce9f8d
[fix] Eagle-2 LLMAPI pybind argument fix. ( #3967 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-29 12:23:25 -07:00
yuanjingx87
2c48ff5898
[feat] add b200 support via slurm ( #4709 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-05-29 14:49:46 +08:00
Yan Chunwei
33a9ba55f5
fix: test trtllm-bench mgmn ( #4613 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 14:43:47 +08:00
ruodil
500aca4f44
test: remove perf test l40s/l20 oom test cases and unwaive tests ( #4755 )
...
Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-05-29 13:58:47 +08:00
QI JUN
058f83e47b
CI: move post-merge multi GPU test of PyTorch backend to H200 ( #4733 )
...
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-29 11:15:56 +08:00
Yiqing Yan
7f29a70f53
Waive L0 test ( #4748 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-29 11:05:27 +08:00
Yan Chunwei
ac17142495
chore: rename ExecutorBindingsWorker/Proxy ( #4716 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 10:32:35 +08:00
Arthur Rasmusson
812b1abf86
feature: KV Cache GPUDirect Storage ( #3209 )
...
Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-28 23:27:43 +00:00
Erin
820c39041f
chore: [nvbug_5273941] unwaive test_llm_loading_from_ckpt_for_tp2 ( #4725 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-29 06:54:32 +08:00
Aurelien Chartier
6cf1e4d0a9
chore: add -f to pkill calls ( #4711 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-29 02:54:31 +08:00
Ivy Zhang
ed3c67e34a
tests: [ https://nvbugspro.nvidia.com/bug/5289908 ] run maverick bf16 on blackwell ( #4722 )
...
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-28 22:05:51 +08:00
xinhe-nv
93283484c2
test: [CI] Add failed cases into waives.txt ( #4688 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-28 22:04:35 +08:00
Yan Chunwei
5506f60037
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs ( #4603 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-28 18:43:04 +08:00
amirkl94
fbec0c3552
Release 0.20 to main ( #4577 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
Signed-off-by: Simeng Liu <simengl@nvidia.com>
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
Signed-off-by: moraxu <mguzek@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>
Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
Co-authored-by: stnie <82932102+stnie@users.noreply.github.com>
Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com>
Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com>
Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-28 16:25:33 +08:00
Pengyun Lin
971d16a2ee
[TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend ( #4623 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-28 11:36:44 +08:00
Yuxian Qiu
5700a4ffcd
feat: Add vanilla MOE. ( #4682 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-28 10:44:14 +08:00
xinhe-nv
bb3d998eb1
test: [CI] remove closed bugs ( #4638 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-27 18:07:59 +08:00
Lucas Liebenwein
5cdd6bb10f
[AutoDeploy] Increased Model Coverage Mass Migration Week 1 ( #4468 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-27 16:43:15 +08:00
Yiqing Yan
f6c50293d2
[Infra][TRTLLM-3929] Rerun failure tests ( #3264 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-27 16:13:23 +08:00
Yiqing Yan
92a7984945
Waive L0 tests ( #4686 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-27 15:07:02 +08:00
xinhe-nv
59f7622281
test: rcca https://nvbugs/5223130 ( #4510 )
...
* add rcca tests
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* skip tests on blackwell
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-27 09:59:47 +08:00
yuanjingx87
732d92ff62
[Infra] - Multi-GPU testing support with Slurm ( #4454 )
...
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-26 19:44:19 +08:00
Enwei Zhu
88190faa34
feat: large-scale EP(part 4: Static EP load balancer integration) ( #4615 )
...
* MoeLoadBalancerConfig
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* MoeLoadBalancer integration
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* config file
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-26 18:25:11 +08:00
Emma Qiao
6f626af386
[TRTLLM-4535][infra]: Add marker TIMEOUT for test level ( #3905 )
...
* Add marker for TIMEOUT
Signed-off-by: qqiao <qqiao@nvidia.com>
* Remove workspace after tests
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add missed property
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add some debug info
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix errors
Signed-off-by: qqiao <qqiao@nvidia.com>
* Testing
Signed-off-by: qqiao <qqiao@nvidia.com>
* Special process for unittests
Signed-off-by: qqiao <qqiao@nvidia.com>
* Move special proecessing unittests to test generating stage
Signed-off-by: qqiao <qqiao@nvidia.com>
* Process for the whole test list
Signed-off-by: qqiao <qqiao@nvidia.com>
* Test more
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add another test case
Signed-off-by: qqiao <qqiao@nvidia.com>
* Change back the setting for testing
Signed-off-by: qqiao <qqiao@nvidia.com>
* Revert another config file
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add descriptionf or timeout in test readme
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-25 23:30:40 -07:00
Yiqing Yan
2fee408536
Waive L0 tests ( #4645 )
...
* Waive L0 tests
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* Apply suggestions from code review
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-26 11:05:01 +08:00
hlu1
4a236d107d
[Fix][Deepseek] Fix bugs in TestDeepSeekR1 ( #4413 )
...
[Deepseek] Fix bugs in TestDeepSeekR1
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-24 09:52:57 +08:00
Yanchao Lu
20c15fc04f
Fix invalid testcase name ( #4626 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-24 00:40:00 +08:00
dominicshanshan
ca3eaf4070
[nvbug/5028235][fix]pytest bindings tokens logtis comparison. ( #4424 )
...
* fix bug 5028235.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* fix bug 5028235 and update comments.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Update tests/unittest/bindings/test_executor_bindings.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Remove redundant code.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Update based on review comments.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-23 20:41:00 +08:00
Robin Kobus
15a59e57f6
[nvbugs/5301492] ci: waive test_workers_kv_cache_aware_router ( #4617 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 20:14:28 +08:00
zhhuang-nv
8452775db8
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA ( #4535 )
...
* optimize kv cache reuse workflow for MLA
write kv cache first and only call up-projection GEMM once
relax contiguous requirements of k/v for setting paged kv cache
return two contiguous tensors when loading MLA KV Cache
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* support fp8 kv cache for MLA kv cache reuse
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
---------
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-23 19:47:50 +08:00
Anthony Chang
bbea2647b1
Qwen3 supports TRTLLM FP4 MoE backend ( #4530 )
...
* MoE TRTLLM backend for Qwen3
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* add extra moe_backend to test
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* address comments
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* conditionally compile kernels on newer archs
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* missing positional arg
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* Update the routing kernels
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Revise usage of TLLM_LOG_ERROR
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* Add unit test for Qwen3 moe (trtllm_gen backend)
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
* improve weight processing speed of moe_backend=TRTLLM; roughly 2x
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* tidy and minor fix
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
* temporarily disable accuracy test that has known issue
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
---------
Signed-off-by: Anthony Chang <anchengc@nvidia.com>
Signed-off-by: Christina Zhang <christinaz@nvidia.com>
Co-authored-by: Christina Zhang <christinaz@nvidia.com>
2025-05-23 18:31:08 +08:00
Yiqing Yan
3ca05330f9
Waive L0 test ( #4609 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-23 15:54:11 +08:00
Bo Li
9ae705af1b
perf: Add fused q_norm/k_norm/RoPE for Qwen3. ( #4482 )
...
* Add Julien's origina kernel.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Get rid of UpdateKVCache functionality.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add kernels.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add torch OP.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update cmake.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Torch OP must use double as argument dtype.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add unittest.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add unittest.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix misaligned access when head_dim=64.
In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Remove unroll (compiler can do that).
Cleanup code.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add switch for interleave.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Refactor vectorized load/store.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Implement is_neox. Result not correct yet.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix is_neox=True.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Add q_weight and k_weight.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
---------
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-05-23 15:31:04 +08:00
bhsueh_NV
6527c055cf
chore: fix bug of llama lora test ( #4566 )
...
* fix bug of llama lora test
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* Update test_llm.py
fix bug detected by pre-commit
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-23 14:06:40 +08:00
coldwaterq
1cf0e672e7
fix: [nvbugs/5066257] serialization improvments ( #3869 )
...
* added a restricted pcikler and depickler in a sepparate serialization function.
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* cleaned up a couple files to reduce conflicts with main.
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* fix unit tests
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* reorder BASE_ZMQ_CLASSES list alphabetically
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* fix tests and move LogitsProcessor registration to base class
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* revert changes to import log of tensorrt_llm._torch.models
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* fix tests and move LogitsProcessor registration to base class
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* additional comments for multiprocess approved list sync
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* add dataclass from tests
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
---------
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-23 13:06:29 +08:00
CarstyYou
ef280e687e
[feat] support fp8 blockscale gemm on sm89 ( #4481 )
...
* [feat] integrate ada blockwise gemm
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [fix] align scale M
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [feat] swizzle mma output
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [test] add ut for sm89
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [delete] remove useless comments
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [chore] codestyle
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [fix] fix review comments
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [chore] fix license
Signed-off-by: CarstyYou <xiy@nvidia.com>
* [chore] fix license
Signed-off-by: CarstyYou <xiy@nvidia.com>
---------
Signed-off-by: CarstyYou <xiy@nvidia.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
2025-05-23 10:39:10 +08:00
Enwei Zhu
d7443b6068
[ https://nvbugspro.nvidia.com/bug/5181262 ] [test] Unwaive Mistral Nemo test ( #4515 )
...
unwaive
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-23 10:14:00 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend ( #4243 )
...
* feat: Enabling dis serving with TRT backend with Python runtime
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing formatting
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing disagg mtp test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Mike Iovine
14fc48ada7
[nvbug/5285881][fix] Fix chunked prefill + overlap scheduler ( #4402 )
...
[fix] Fix chunked prefill + overlap scheduler
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-23 04:38:22 +08:00
Venky
c713eb5799
test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) ( #4446 )
...
ultra
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-05-22 13:07:33 -07:00
xinhe-nv
22c01d5b21
test: [CI] Add failed cases into waives.txt ( #4549 )
...
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* fix test issues
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-22 17:18:53 +08:00
ruodil
1a45890dae
test: waive hanging cases for perf test ( #4562 )
...
waive hanging cases
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
2025-05-22 15:50:05 +08:00
Kaiyu Xie
2898d268f9
feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856 ) ( #4349 )
...
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>
2025-05-22 11:46:06 +08:00
HuiGao-NV
bc9f1dbede
fix[nvbug-5228840]: Remove test cases of feature not supported anymore ( #3972 )
...
* Remove waived cases
* Remove test cases of not supported feature
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-22 11:18:58 +08:00
Aurelien Chartier
f491244c84
feat: add dataset support for benchmark_core_model with LLMAPI ( #4457 )
...
* feat: add dataset support for benchmark_core_model with LLMAPI
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-21 19:18:43 -07:00
Kaiyu Xie
099cd3ce07
chore: Add all_reduce.py benchmark script to test ( #4537 )
...
Add all_reduce.py script to test
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-22 10:13:27 +08:00
Michal Guzek
9033dd987d
[TRTLLM-4932] Add CLI accuracy tests for Phi-4-mini-instruct ( #4415 )
...
Add phi-4-mini CLI acc test
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-05-22 09:56:48 +08:00
Yan Chunwei
4798d088d9
chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs ( #3823 )
...
* partition LlmArgs
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* update backend
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-22 09:40:56 +08:00
Chuang Zhu
44cfd757b2
Agent interface impl for NIXL ( #4125 )
...
* agentConnection
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
recv
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
agentState
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
NIXL interfaces
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
update cmakelists
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
nixl improve
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
remove cppzmq
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
fix
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
transferAgent remove register
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
work for cache Test
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
reduce sleep time
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
fix test
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
intergarte
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
nixl env
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
fix rebase error
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
cpp test
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
stash for send metaData
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
loadRemoteMD after fetchRemoteMD
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
workaround for mixed gen and context
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
test_env
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
avoid port conflict in test
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* format
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* use std::string
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* typo
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* fix transferAgentTest
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
---------
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 09:09:41 +08:00
Aurelien Chartier
1681e9fd1e
chore: remove extra PYTHONPATH ( #4453 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-21 17:38:01 -07:00
Dom Brown
1cffa99792
test: Split test_simple into mpi_utils and cache transceiver tests for DGX ( #4451 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-22 04:26:21 +08:00
Zongfei Jing
dbaddb3a29
Adding two-shot allreduce kernel and mnnvl multicasting buffer ( #4216 )
...
* Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe
Signed-off-by: Shiyu Li <shili@nvidia.com>
Adding comments
Signed-off-by: Shiyu Li <shili@nvidia.com>
Add unittest of the twoshot kernel.
Signed-off-by: Shiyu Li <shili@nvidia.com>
Update dispatch logic
Signed-off-by: Shiyu Li <shili@nvidia.com>
Use cpu barrier instead of GPU at init
Signed-off-by: Shiyu Li <shili@nvidia.com>
Merge dispatch logic fix
Signed-off-by: Shiyu Li <shili@nvidia.com>
Update the kernel to use GPU-managed buffer
Signed-off-by: Shiyu Li <shili@nvidia.com>
* Refine
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Clean code
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Fix compile error
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Fix issue
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Clean up
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Simplify AllReduce interface
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Rename
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Fix warning
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Tidy code
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Rename
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Fix compile error
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Refine
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Skip ut for no_fusion
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Refine
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
---------
Signed-off-by: Shiyu Li <shili@nvidia.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: Shiyu Li <shili@nvidia.com>
2025-05-22 03:42:36 +08:00
Venky
0a8461d54c
test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) ( #4499 )
...
add low concurrency perf tests
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-21 10:46:48 -07:00
xinhe-nv
407ef08662
tests: add qwene fp4 tests into QA test list & update sanity test list ( #4478 )
...
* update sanity test list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update test list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-21 16:52:02 +08:00
ruodil
83f1933f0c
test: add failed case in waive list and fix some test script issue for perf test ( #4527 )
...
add failed case in waive list and fix some test script issue
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
2025-05-21 16:37:25 +08:00
ruodil
3d9a2b5eb7
test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases ( #4285 )
...
1.remove enable_overlap_schedule in pytorch config
2.rename model_yaml_config.py to pytorch_model_config.py and set enable_chunked_prefill to be true for cases with isl>2048
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-21 14:26:56 +08:00
QI JUN
15317ece5a
CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite ( #4520 )
...
waive test_fp8_block_scales_4gpus of deepseek v3 lite
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-21 13:19:43 +08:00
xinhe-nv
750f412b8f
tests: add llama 3.3 70b 2 nodes tests ( #4391 )
...
* add llama 3.3 70b 2 nodes tests
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* remove enable_overlap_scheduler parameter
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-21 12:42:45 +08:00
Chuang Zhu
ab5bea957d
unwaive some disagg test ( #4476 )
...
* unwaive some disagg test
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* pytest.mark.skip_less_device(4)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
---------
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-21 11:45:11 +08:00
Ruoqian Guo
db7446fda7
Feat: add deep_gemm swapab Kernel ( #4430 )
...
* feat: add deepgemm_swapab
feat: add fp8_gemm_kernel_swapab
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
feat: set threshold for deepgemm and deepgemmswapab
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
* docs: update README.md
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
* fix: std::runtime_error needs #include <stdexcept>
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
* chores: remove the redundant code
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
* feat: support for dense deep_gemm swapab
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
* chores: remove redundant code
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
---------
Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
2025-05-21 10:48:43 +08:00
QI JUN
2372589689
Chore: waive torch compile test cases of deepseek v3 lite ( #4508 )
...
waive torch compile test cases of deepseek v3 lite
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-21 10:43:31 +08:00
Shi Xiaowei
3d62727303
test: NIXL single process test ( #4486 )
2025-05-21 10:41:46 +08:00
Thor Johnsen
5d438be59a
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter ( #3936 )
...
* v1.5
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
v1.5.4 Add back draft_overhead to spec dec stats
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* v1.5.5: fix CI error
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.6: fix CI error 8196 > 8192
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* Address reviewer concerns
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* Address reviewer concerns
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* precommit run
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
* v2.0: Address reviewer concerns
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v2.1: add fix from wili
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* Revert changes that require use of TypeAlias because that requires python version >= 3.10
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
---------
Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-21 10:40:00 +08:00
Yan Chunwei
9199793848
fix: llmapi-launch add add trtllm-bench test with engine building ( #4091 )
...
* add trtllm-bench mgmn test
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-21 10:18:01 +08:00
Zheng Duan
77a0189554
feat: conditional disaggregation in disagg server ( #3974 )
2025-05-21 09:57:46 +08:00
Venky
9a8c3ece22
test(perf): Add remaining Phi-4-mini-instruct perf tests ( #4443 )
...
add remaining 2 phi cpp perf tests
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-21 09:26:12 +08:00
xinhe-nv
19c6e68bec
test: [CI] remove closed bugs ( #4417 )
...
* waives closed bugs
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waives
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-21 09:13:25 +08:00
Rohan Varma
3d940e77f0
[TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug ( #4290 )
...
* add bidirectional support and fix EarlyStopDecoder unsqueeze to be compatible with LogitsStorage
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
* run pre-commit
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
* instead of bidirectional flag use ModelConfig.is_generation
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
* fix unit test to extract logits from correct dim
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
---------
Signed-off-by: Rohan Varma <rohanv@nvidia.com>
2025-05-20 10:15:36 -07:00
Robin Kobus
8564c5a41f
refactor: Unify request order in TRT and PyTorch workflow ( #4096 )
...
* chore: Partition context requests in MicroBatchScheduler
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fixup! chore: Partition context requests in MicroBatchScheduler
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-20 18:49:27 +02:00
Yan Chunwei
174c5188a2
fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu ( #4428 )
...
* add test
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-20 20:16:14 +08:00
tomeras91
7b09cd904d
[TRTLLM-5085][fix] Nemotron H correctness test ( #4444 )
...
* Replace sanity test for nemotron h with a correctness test
* Add prefill+decode reference logprobs from initial implementation + batched forward test
* Add testing that decode matches prefill - compare decode vs all prefilling the decoded tokens
2025-05-20 17:55:25 +08:00
dongxuy04
21aff2e313
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) ( #4384 )
...
* first commit of cpp moe loadbalance code
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python bindings for moe load balance
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add python wrapper, ut and bug fixes
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add binding for layerId and update binding test
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add host tensor sharing and ut
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-05-20 17:53:48 +08:00
bhsueh_NV
ec4190fb71
infra: Add qwen3 235B tests into QA ( #4483 )
...
* add qwen3 qa test
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* add qwen3 test into qa list
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-20 17:37:09 +08:00
Lucas Liebenwein
de409e8468
[AutoDeploy] HF factory improvements ( #4371 )
...
* [AutoDeploy] HF factory improvements
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* improve monkey-patches and add unit tests
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-19 20:13:43 -07:00
ruodil
b5edf13b33
test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro ( #4282 )
...
* add cases for rtx_pro_6000 and update test filter
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* amend a typo in model llama_v3.1_405b_instruct fp4 and add more cases for rtx pro 6000 and waive_list
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-20 10:58:05 +08:00
Michal Guzek
0a342a42f7
[TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant ( #4362 )
...
* Add CLI TestLlama3_3_70BInstruct acc tests
Signed-off-by: moraxu <mguzek@nvidia.com>
* Add tests to qa lists
Signed-off-by: moraxu <mguzek@nvidia.com>
* Add comment
Signed-off-by: moraxu <mguzek@nvidia.com>
* Fix test names
Signed-off-by: moraxu <mguzek@nvidia.com>
* Update yaml files
Signed-off-by: moraxu <mguzek@nvidia.com>
* Update cli file
Signed-off-by: moraxu <mguzek@nvidia.com>
---------
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-05-20 09:48:14 +08:00
xinhe-nv
402385588d
test: [CI] Add failed cases into waives.txt ( #4429 )
...
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waive id
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-20 09:43:55 +08:00
kanghui0204
6f3922f318
feat: Low Precision Allreduce for PCIe based GPU ( #4344 )
...
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.
Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-20 06:53:46 +08:00
Yuxian Qiu
c8e062bfd3
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. ( #4399 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-19 14:25:36 -07:00
Venky
bb02d86b54
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) ( #4128 )
...
* changes to run llama-v3.3-nemotron-super-49b
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
* yapf
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
* address review comments pt 1
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
* re-add cpp super tests
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
---------
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-19 12:00:48 -07:00
Perkz Zheng
1c5b0d6a13
[Feat] add chunked-attention kernels on Hopper (for llama4) ( #4291 )
...
* update cubins
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
* add mtp for fmha_v2 MLA kernels and add chunked-attention support for hopper fmha kernels
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
---------
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-05-19 09:57:10 -07:00
Faraz
7656af1b57
[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) ( #4335 )
...
* add mixtral7x8b fp8 test with fixed cutlass fp8 moe gemm
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
* update cutlass versions
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
* added internal cutlass with fix and docker update
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
* added mixtral to pro 6000
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
---------
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-05-19 08:56:21 -07:00
liji-nv
58e405624a
[ https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 ( #3952 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
Iman Tabrizian
c6074c47da
Add llama4 disagg accuracy tests ( #4336 )
...
* Add llama4 disagg accuracy tests
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Make it async and add GSM8K benchmark
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-19 21:55:08 +08:00
Shi Xiaowei
001704cc6a
fix: temp disable the problem test ( #4445 )
...
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-05-19 21:54:32 +08:00
Dom Brown
c45f414bbf
Test: Improve model re-use in C++ DGX tests for CI stability ( #4263 )
...
* Fix padded vocab size for Llama
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Refactor multi GPU llama executor tests, and reuse the built model engines
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Fix test list typo
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Further WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Update test lists and readme
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Try parametrize for asymmetric
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Parametrize + skip unsupported combinations
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
* Update test list
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
* Reduce environment duplicated code
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
2025-05-19 14:20:21 +01:00
Shi Xiaowei
df2798e0c3
feat: NIXL interface integration ( #3934 )
...
NIXL interfaces
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-05-19 18:18:22 +08:00
Zhenhuan Chen
e70a205dab
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split ( #4337 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-05-19 17:53:41 +08:00
Kaiyu Xie
a43914619f
fix: wrong argument name enable_overlap_scheduler ( #4433 )
...
Fix wrong argument
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-19 15:02:22 +08:00
Yuxian Qiu
cf6cd940e5
feat: Add pp support for hybrid attn/mamba model ( #4358 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-19 14:47:45 +08:00
Yan Chunwei
5b1c88de8d
chore: cleanup perf_evaluator code ( #3833 )
...
* chore: cleanup perf_evaluator code
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* up
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-19 13:21:36 +08:00
Ivy Zhang
58d2508b89
tests: Add test cases for rcca cases ( #4347 )
...
* add qwen2_0_5_instruct cp4 test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add qwen2.5 fp8 kvcache test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add ds distill qwen cpp runner test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* trial
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
---------
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-19 12:06:43 +08:00
Ivy Zhang
c4a0d768b5
tests: add qa test mentioned in docs ( #4357 )
...
* add nemotron-h and llama_70b cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* trial
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add llm decoder quick_start case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update nemotron-h test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add qwen3 quickstart test
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add trtllm_decoder accuracy test
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove quickstart test for llm_decoder
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix import error
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* nemotronh fp8 trial
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix name
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove nemotronh-fp8
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
---------
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-19 10:06:51 +08:00
Faraz
791c209006
[TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) ( #4363 )
...
* added nemotron 49b fp8 for B40 release
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
* add tests to QA list
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
* pre-commit changes
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
---------
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-05-19 09:30:24 +08:00
Iman Tabrizian
7de90a66bc
Remove vila test ( #4376 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-19 09:02:39 +08:00
Pengyun Lin
039f7e3118
[ https://nvbugspro.nvidia.com/bug/5243740 ][fix] deduce default max_tokens for trtllm-serve ( #4265 )
...
* Deduce default max_tokens for trtllm-serve
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Improve executor_config.max_seq_len assignment in TRT workflow
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Enhance error message
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Add deduced max_tokens test
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
---------
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-19 00:34:40 +08:00
Yanchao Lu
0d7269e2a7
[Infra][Docs] - Some clean-up for the CI pipeline and docs ( #4419 )
...
* [Docs] - Some clean-up for the docs
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
* [Infra] - Some clean-up for the CI pipeline
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
---------
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-19 00:07:45 +08:00
shaharmor98
27afcb9928
add changes for fp8, nemotron-nas, API ( #4180 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-18 23:27:25 +08:00
Venky
fb663b637a
Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) ( #4195 )
...
* add ll-nm-nano tests that map to nim requirements
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
* prune some pytorch cases (fp8)
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
* removing pyt backend test changes
- When validating the pytorch tests with the isl/osl/conc/quant settings (that is done for cpp backend too), seeing hangs that need further debugging.
- Therefore don't want to block this PR, hence removing them.
- Seeing
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
---------
Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-17 22:46:21 +08:00
Yuxian Qiu
cc1bba1686
test: Waive tests for nvbugs/5286795. ( #4409 )
...
* Waive tests for nvbugs/5286795.
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
* Apply suggestions from code review
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
---------
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-17 19:41:05 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible ( #3439 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
hlu1
befb93cbff
[Deepseek] Add accuracy test references for fp8 kvcache ( #4374 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-17 11:23:00 +08:00
Lucas Liebenwein
7c85890ec7
[AutoDeploy] eager pattern matcher new pattern ( #4370 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:44 -04:00
Netanel Haber
9cd8148f28
API Breaking Change + Readability: "decoder"->"sampler" ( #4121 )
...
* *decoder*->*sampler*; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors
* **Breaking Change**, as it changes public interfaces, main changes:
* PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler.
* Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py.
---------
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-05-16 23:52:25 +08:00
Lucas Liebenwein
8e4320ede5
[AutoDeploy] configurable cache resize ( #4372 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 10:07:09 -04:00
Fridah-nv
bce281d592
feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) ( #3638 )
...
* add docstring to summarize current rope support
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: replace call_method, adjust inserting order of cos_sin_cache calculation node
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add unit test for triton rope and ds rope
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* update rope matcher to match DS RoPE, add custom op for reference, add unit test case
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* cache cos[pos_idx].unsqueeze and sin[pos_idxs].unsqueeze
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor doc update
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate pattern matching and optimization for explicit and complex rope + minor updates
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* clean rope impl in repo
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* replace fused_flattened_mla_with_cache's rope impl with torch_apply_rope_with_qk_interleaving, update unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate layout infer and transpose to a new transformation
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* update rope_with_explicit_freqs and rope_with_input_interleaved to expose unsqueeze_dim and support match_rope_layout, add unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* solve merge conflict in transform.py, need to fix optimize_rope with cuda graph capture
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor clean up after rebase
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
* fix pre-commit
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* support map to bnsd layout and infer unsqueeze_dim from op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix cos/sin not the same across prompts in the same batch issue when mapping to flashinfer op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix for unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix custom op input/output node ordering issue for DeepSeek V3 rope
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* clean code
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move flattening of cos_sin_cache to the graph, update flashinfer op docstring and test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* debug transform unit test failure
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
---------
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-05-16 09:55:32 -04:00
liji-nv
fb437ed709
[CI] waive accuracy/test_cli_flow.py::TestTinyLlama1_1BChat::test_pp4 ( #4397 )
...
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-16 20:18:07 +08:00
Nikita Korobov
fa3879629e
feat: TRT-LLM Gen integration for BMM and MoE refactoring ( #4280 )
...
- Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator.
- Refactors TRT-LLM Gen MoE runner to call to BMM interface
- The accuracy is verified for DeepSeek R1 FP4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-05-16 13:31:53 +02:00
Emma Qiao
27bdd0c82d
[TRTLLM-4886][infra]Try another timeout opt to exit test thread directly instead of gracefully ( #4341 )
...
* Try another timeout opt to kill test thread
Signed-off-by: qqiao <qqiao@nvidia.com>
* Return true when try to delete non-existing result file
Signed-off-by: qqiao <qqiao@nvidia.com>
* quick test for the result file
Signed-off-by: qqiao <qqiao@nvidia.com>
* Change back the global timeout setting
Signed-off-by: qqiao <qqiao@nvidia.com>
* Try to kill test in internal pytest
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-16 17:56:40 +08:00
Daniel Cámpora
df19430629
chore: Mass Integration 0.19 ( #4255 )
...
* fix: Fix/fused moe 0.19 (#3799 )
* fix bug of stream init
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bug
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix: Add pre-download of checkpoint before benchmark. (#3772 )
* Add pre-download of checkpoint before benchmark.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Add missing remote code flag.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Move from_pretrained to throughput benchmark.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Move download and use snapshot_download.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Removed trusted flag.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Fix benchmark command in iteration log test.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---------
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* [https://nvbugspro.nvidia.com/bug/5241495 ][fix] CUDA Graph padding with overlap scheduler (#3839 )
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fuse
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* TRTLLM-4875 feat: Add version switcher to doc (#3871 )
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
* waive a test (#3897 )
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939 )
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
* fix: remote mpi session abort (#3884 )
* fix remote mpi session
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* skip fp8 gemm for pre-hopper (#3931 )
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* [https://nvbugspro.nvidia.com/bug/5247148 ][fix] Attention DP with overlap scheduler (#3975 )
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* update multigpu list
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix namings
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* Doc: Fix H200 DeepSeek R1 perf doc (#4006 )
* fix doc
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
* update perf number
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
---------
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
* Fix the perf regression caused by insufficient cache warmup. (#4042 )
Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* doc: Update 0.19.0 release notes (#3976 )
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
* Optimize the AutoTuner cache access code to reduce host code overhead. (#4060 )
The NVFP4 Linear op is very sensitive to the host overhead.
This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Update switcher (#4098 )
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
* doc: update release notes (#4108 )
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
* docs:update 0.19 doc. (#4120 )
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
* docs:add torch flow supported model list. (#4129 )
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
* doc: Release V0.19 Perf Overview Update (#4166 )
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
* Fix readme of autodeploy.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update tensorrt_llm/_torch/pyexecutor/llm_request.py
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Revert mgmn worker node.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Change to disable_overlap_scheduler.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>
2025-05-16 10:53:25 +02:00
HuiGao-NV
d5578b37fc
Change the method to calculate kv memory size in tests ( #4332 )
...
* Change the method to calculate kv memory size in tests
* Set larger peak memory size to llama case
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-16 15:35:40 +08:00
xinhe-nv
500b43e90c
test: [CI] remove closed bugs ( #4345 )
...
update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-16 13:47:42 +08:00
Stanley Sun
11aa50d1ea
test: add kv cache aware test cases to qa test list ( #4257 )
...
add kv cache_aware test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-05-16 12:47:01 +08:00
WeiHaocheng
54d28718c7
feat: support benchmark on scaffolding ( #3328 ) ( #4286 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-16 12:28:49 +08:00
QI JUN
c4cd403af9
[CI] waive test_chunked_prefill test cases ( #4380 )
...
waive test_chunked_prefill
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-16 10:27:20 +08:00
Suyog Gupta
b0f7522c82
[AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile ( #4240 )
...
* add a torch-compile backend
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* readme changes
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* plumb torch-compile through build_and_run_ad.py
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* plumb torch-compile through build_and_run_ad.py
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* plumb torch-compile through build_and_run_ad.py
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* add torch-cudagraph backend
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* update readme
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* update readme
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* further enhanced compiler backends
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* further enhance readme
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* better specified defaults in simple_config.py
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* fix typo in simple_config.py
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* updated deepseek-v3 support
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* revert accidental deletion in AD Readme
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 08:38:15 +08:00
Yechan Kim
c6e2111f4e
feat: enhance trtllm serve multimodal ( #3757 )
...
* feat: enhance trtllm serve multimodal
1. made the load_image and load_video asynchronous
2. add image_encoded input support to be compatible with genai-perf
3. support text-only on multimodal mdoels(currently, Qwen2-VL & Qwen2.5-VL)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* add test
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* fix bandit
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* trimming uils
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* trimming for test
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* genai perf command fix
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* command fix
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* refactor chat_utils
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* stress test genai-perf command
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
---------
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-05-15 16:16:31 -07:00
Iman Tabrizian
4c7191af67
Move Triton backend to TRT-LLM main ( #3549 )
...
* Move TRT-LLM backend repo to TRT-LLM repo
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Address review comments
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* debug ci
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Update triton backend
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Fixes after update
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-16 07:15:23 +08:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism ( #4034 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
Venky
adb0839a33
test(perf): Add Phi-4-mini-instruct to perf tests ( #4267 )
...
* add phi-4-mini-instruct
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
* trim tests
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
---------
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-05-15 21:27:03 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. ( #4244 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
Yanchao Lu
5ce1102a02
Revert "[test] add qa test mentioned in docs" ( #4355 )
...
Revert "[test] add qa test mentioned in docs (#4248 )"
This reverts commit b0ce1371ee .
2025-05-15 18:47:30 +08:00
Stanley Sun
9d3e05486b
test: add qa test list for rtx5090 and rtx_pro_6000 ( #4254 )
...
* add test list for rtx5090 and rtx_pro_6000
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* add 2gpu llama70b test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* remove duplicate and invalid test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
* add 2gpus test cases
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
---------
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-05-15 17:57:31 +08:00
zhhuang-nv
d6b741ddfe
[fix] test_no_kv_cache_reuse for overlap_scheduler ( #4350 )
...
fix test_no_kv_cache_reuse for overlap_scheduler
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-15 16:43:53 +08:00
xinhe-nv
14bfb5e0d6
test: FIX test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus ( #4283 )
...
* update test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* skip llava-v1.6-mistral-7b-hf-vision-trtllm on L40S
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-15 15:57:44 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA ( #3571 )
...
* support kv cache reuse for MLA
load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* add CI test
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* use GPTJ style RoPE for MLA
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix rebase error and some docs
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix kv_lens
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* tiny fix
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix torch compile
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix: use normal device memory instead of pinned memory for unit test
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
* fix L0 tests
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* fix torch compile after rebase
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
* resolve comments again
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
---------
Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Kaiyu Xie
b4e5df0ee0
Breaking change: perf: Enable scheduling overlap by default ( #4174 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-15 14:27:36 +08:00
dominicshanshan
404fbe9b32
[ https://nvbugs/5277113 ][fix]genai-perf API change stress test ( #4300 )
...
* fix bug 5277113.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* fix bug 5277113 and 5278517.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-05-15 14:12:34 +08:00
Ivy Zhang
b0ce1371ee
[test] add qa test mentioned in docs ( #4248 )
...
* add nemotron-h and llama_70b cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* trial
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add llm decoder quick_start case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update nemotron-h test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add qwen3 quickstart test
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add trtllm_decoder accuracy test
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove quickstart test for llm_decoder
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
---------
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-15 13:37:11 +08:00
hlu1
3ea42e7519
[test] Reorganize TestDeepSeekR1::test_nvfp4_8gpus ( #4346 )
...
Reorganize TestDeepSeekR1::test_nvfp4_8gpus
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-15 13:09:13 +08:00
Zeyu WANG
2681b26e48
[TRTLLM-2795] feat: Add yarn support for other models in trt-flow ( #3840 )
...
Add yarn support for general models(e.g. llama, qwen) other than deepseek in trt-flow.
Signed-off-by: Zeyu Wang <zeyuw@nvidia.com>
2025-05-15 11:03:57 +08:00
Mike Iovine
f9adac3dea
[feat] Enable chunked context for flashinfer ( #4132 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-15 10:59:38 +08:00
QI JUN
498ce8a056
Revert "feat: Low Precision Allreduce for PCIe based GPU" ( #4340 )
...
Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851 )"
This reverts commit 5e634dd1bd .
2025-05-15 09:52:39 +08:00
hlu1
7fb0af9320
[fix] Remove stale cublas heuristics ( #4326 )
...
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-14 17:35:51 -07:00
Robin Kobus
d31fefde2c
[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow ( #4092 )
...
* chore: Remove GptSession/V1 from TRT workflow
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove stateful decoders
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession buffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession utils
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession kernels
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove V1 GPT models from tests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSessionBenchmark from scripts and docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove gptSession IO classes
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from test lists
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove GptSession from docs
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove useless encoder test
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove mActualBatchSize from DecoderState
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove static batching from ExecutorTest
- Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter.
- Adjusted related test functions to reflect the changes in parameter lists.
- Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-14 23:10:04 +02:00
sugunav14
7c828d767f
feat: [AutoDeploy] DSV3 mla attn ref op ( #4272 )
...
* raw ref op + new patch untested
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
* Added mla attn ref op and unit tests for attn + module patches
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
* update stray changes in deepseek.py
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
* Updated stale documentation
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
* removed stray update in sdpa return shapes
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
---------
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-05-15 01:58:20 +08:00
Faraz
42de79d49e
test: Added tests for Llama3.1-70B-BF16 on SM120 ( #4198 )
...
* Added tests for Llama3.1-70B-BF16 on SM120
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
* solve conflicts add more tests
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
---------
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-05-14 11:57:49 -04:00
Yanchao Lu
504f4bf779
[Infra] - Update the upstream PyTorch dependency to 2.7.0 ( #4235 )
...
[Infra][TRTLLM-4941] - Update the upstream PyTorch dependency to 2.7.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-14 22:28:13 +08:00
Kaiyu Xie
6c45586c51
chore: Remove deprecated Python runtime benchmark ( #4171 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-14 18:41:05 +08:00
HuiGao-NV
f4059c6e2e
Add test case for kv memory estimation ( #4158 )
...
* Add test case for kv memory estimation
* Dump running log into file and parse kv cache memory size from file
* Set bigger peak memory size for mixed percision case and test_ptp_quickstart_advanced_eagle3 case
* Revert change to usage of fraction
* use context manager to guard temp files
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-14 18:39:25 +08:00
xinhe-nv
f2bfe2f84f
test: [CI] remove closed bugs ( #4207 )
...
update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-14 17:59:05 +08:00
DylanChen-NV
206f82115d
[bug/5247505] fix: CP accuracy on Blackwell ( #4188 )
...
* fix xqa params for cp
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* add test
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* add test
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* try adding B200 multi gpu test
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
* add accuracy tests for cp
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
---------
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-05-14 17:40:50 +08:00
Anurag Mukkara
b15f57763d
tests: PyTorch multimodal using keyword match ( #4215 )
...
* keyword accuracy check for pytorch multimodal
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Change keywords for some prompts
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Delete full text answers
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
* Cleanup debug code
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
---------
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-05-14 17:18:43 +08:00
kanghui0204
5e634dd1bd
feat: Low Precision Allreduce for PCIe based GPU ( #3851 )
...
This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process.
Signed-off-by: Hui Kang <hkang@nvidia.com>
Co-authored-by: Hui Kang <hkang@nvidia.com>
2025-05-14 16:45:43 +08:00
Yiqing Yan
a66a02a75a
[Infra] Waive L0 test ( #4295 )
...
Waive L0 test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-14 16:38:33 +08:00
Barry Kang
20b42912ce
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper ( #4123 )
...
Support DeepSeek-R1 W4A8 on Hopper
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-14 15:48:07 +08:00
Zongfei Jing
bb17649517
test: Add UT for moe trtllmgen ( #4258 )
...
* Add ut for moe trtllmgen
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
* Update tests/unittest/_torch/modeling/test_modeling_deepseek.py
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
---------
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-14 15:22:58 +08:00
bhsueh_NV
1a9298bc66
CI: add fp8/fp4 ci on Qwen3-30B-A3B ( #4266 )
...
add fp8/fp4 ci on Qwen3-30B-A3B
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-14 14:38:04 +08:00
brb-nv
8280c3d4f2
feat: Support Gemma3-1b-it in Pytorch workflow ( #3999 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 14:02:44 +08:00
Yi Zhang
86ae506b9d
[fix] Enable pp tests ( #3978 )
...
Fix misrebase issue
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-14 10:51:20 +08:00
Fridah-nv
21dbd163a7
[TRTLLM-5188] fix: [AutoDeploy] unwaive AD build test ( #4273 )
...
* unwaive small build test
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
* unwaive mutigpu/integration tests
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
* fix for torch.compile+flashinfer attention
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
---------
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
2025-05-14 10:40:12 +08:00
brb-nv
1ef117688c
test: Validate FP8 and LoRA for Gemma3 ( #3670 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-13 17:28:02 -07:00
Iman Tabrizian
f408de2d99
Waive disagg kv cache load balancer test ( #4276 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-14 06:03:24 +08:00
brb-nv
cd5b3d21a0
feat: Support Mistral Small 3.1 24B VLM in TRT workflow ( #4183 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-05-14 03:47:22 +08:00
Yiqing Yan
290649b6aa
[Infra] Waive L0 test ( #4269 )
...
Waive L0 test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-13 23:06:13 +08:00
Yiqing Yan
bfa16a63d4
[Infra] Waive L0 test ( #4268 )
...
Waive L0 test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-13 22:43:17 +08:00
dominicshanshan
44d6adfb68
Waive stress test. ( #4262 )
...
* Waive stress test.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-13 21:01:57 +08:00
Enwei Zhu
8f68d56cc1
[ https://nvbugs/5220763 ] [test] Unwaive Mixtral FP8 TP2 test ( #4252 )
...
unwaive
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-13 15:55:33 +08:00
Yiqing Yan
fda8b0277a
[Infra][TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04 ( #4049 )
...
* [TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix review
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* update images
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* Update jenkins/L0_Test.groovy
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* update image name
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-13 14:59:12 +08:00
ruodil
d555fe2530
test: fix for perf test script issue ( #4230 )
...
fix for perf test script issue
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-13 10:29:20 +08:00
xinhe-nv
0cebc16139
test: [CI] Add failed cases into waives.txt ( #4205 )
...
waive tests
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-13 10:22:42 +08:00
xinhe-nv
7ebae4dcaa
test: [CI] Add failed cases into waives.txt ( #4203 )
...
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waives
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-13 10:08:02 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend ( #4156 )
...
* feat: Add per-request stats support with PyT backend
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Adding unit test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing stats unit test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing test with overlap
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Simeng Liu
286a789549
feat: Add heuristic for GroupRMSNorm kernel selection. ( #4047 )
...
* feat: Add heuristic for GroupRMSNorm kernel selection.
Implements a logistic regression model to dynamically select between:
- GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions
(better SM occupancy in most cases)
- GroupRMSNormLargeBatch: Allocates warps proportional to max dimension
(better block scheduling in large batch scenarios)
Selection heuristic considers batch size, allocated warps, and scheduling
efficiency on the current GPU architecture. Models for Compute Capability
9.x and 10.x are trained base on nsys kernel runtime data.
The default kernel selection is the base kernel.
The python operator group_rms_norm will use the heuristic by default.
User can pick to use the base or large batch kernels as well.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Address the comments.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-13 08:52:53 +08:00
Enwei Zhu
035d915fea
[TRTLLM-5081] [test] Align parametrize_with_ids to the pytest behavior ( #4090 )
...
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* normalize mtp_nextn
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* update test_durations
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-13 07:41:51 +08:00
wili
eba3623a54
Feat: Variable-Beam-Width-Search (VBWS) part4 ( #3979 )
...
* feat/vbws-part4-v1.8: rebase
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* feat/vbws-part4-v1.9: fix incorrect output when using short output length
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.9.1: remove useless variables
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.9.2:fix incorrect output when using short output length
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.9.3: rebase
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.9.4: rebase
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
* v1.9.5: remove API change
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
---------
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-05-12 22:32:29 +02:00
Enwei Zhu
c31ca1688c
[ https://nvbugs/5214229 ] [fix] Unwaive lm_head quantization case ( #4222 )
...
unwaive
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 20:23:06 +08:00
Zheng Duan
c9e2a963e0
feat: add kv cache aware router ( #3831 )
...
* kv cache aware router
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* add tests
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* router config
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* eviction test
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
add test
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* eviction detect in worker test
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* move worker tests to single gpu
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* reduce memory fraction
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* fix partial block
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
---------
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-12 07:23:57 -04:00
Yixin Dong
c90ebadd84
feat: Support the Structural Tag in guided decoding ( #4066 )
...
* finish
Signed-off-by: Ubospica <ubospica@gmail.com>
* update
Signed-off-by: Ubospica <ubospica@gmail.com>
* update
Signed-off-by: Ubospica <ubospica@gmail.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* exc overlap scheduler
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* add test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix api ref
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 17:24:50 +08:00
Yechan Kim
3e9bda3a09
[feat] Support HyperCLOVAX-SEED-Text language part ( #3902 )
...
* feat: support HyperCLOVAX-SEED-Text language part
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* add Pytorch flow and remove test file
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* revert summarize
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* fix summarize
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* remove from pytorch example
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
---------
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-05-12 16:05:14 +08:00
Zhenhuan Chen
9212e9a740
[TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller ( #4151 )
...
feat(scaffolding): make sampling_params only setable by controller
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-05-12 15:29:09 +08:00
Ivy Zhang
ee92edf2b4
[ https://nvbugspro.nvidia.com/bug/5270564 ][test] skip per-hopper for llama4 ( #4211 )
...
skip per-hopper for llama4
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-12 15:27:15 +08:00
ruodil
9c03a7ab74
test: add llama_3.2_1B model and fix for test lora script issue ( #4139 )
...
* test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* add llama_3.2_1B model and fix for lora script issue
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
2025-05-12 14:51:59 +08:00
xinhe-nv
849d9c343c
tests: https://nvbugs/5219534 remove failed tests from test list ( #4113 )
...
remove unsupported tests
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-12 14:13:40 +08:00
Yiqing Yan
3c54e84e47
[Infra] Waive L0 test ( #4212 )
...
Waive L0 test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-12 11:37:49 +08:00
QI JUN
f021afa241
[CI] waive two multi-gpu test cases ( #4206 )
...
waive two multi-gpu test cases
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-12 08:04:48 +08:00
Enwei Zhu
7db368c72c
test: Remove CNN Dailymail tasks in favor of GSM8K ( #4187 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-10 09:02:07 +08:00
Dom Brown
2d0f93a054
Refactor: Restructure C++ tests for better modularisation of non-shared code ( #4027 )
...
* Refactor: Restructure C++ tests for better modularisation of non-shared code
Start cleanup of pytest code for C++ tests
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Clean up names and remove references to test_cpp.py
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
WIP
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Move multi-GPU code
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Update doc and try un-waiving
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Update multi GPU file check
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Address minor multi-GPU setup bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-05-09 19:16:51 +01:00
Mike Iovine
4b8ba7ad61
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue ( #4069 )
...
[fix] Fix llama 4 test lists
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-09 22:45:14 +08:00
Tracin
446f62bbab
chore: Deprecate evaltool ( #4173 )
...
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-09 20:31:53 +08:00
ruodil
bf5b2a2e0a
test: amend regex match for perf throughput ( #4186 )
...
amend regex match for perf throughput
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
2025-05-09 17:33:25 +08:00
WeiHaocheng
0f01826dde
feat: support task collection for to collect information ( #3328 ) ( #3824 )
...
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-09 17:09:01 +08:00
xinhe-nv
9082411a50
test: [CI] Add failed cases into waives.txt ( #4165 )
...
wavie oom tests
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-09 16:56:30 +08:00
ruodil
5ce5b81281
test: amend default pytorch extra-llm-api-config.yml in perf test ( #4176 )
...
* amend default pytorch extra-llm-api-config.yml
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* add print info to separate cases in output log
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
2025-05-09 16:46:48 +08:00
xinhe-nv
1d26a3fd7c
test: skip tests on b200 ( #3913 )
...
* skip tests on b200
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* skip phi-3-128k
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-09 14:51:55 +08:00
Fanrong Li
77f8e43592
[fix] Fix relaxed acceptance to support enabling it in context phase ( #4126 )
...
* fix relaxed acceptance to support enable this feature in context phase.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
* fix sample_and_accept_draft_tokens unit test.
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
---------
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 14:11:14 +08:00
Bo Li
e3cf3fd15f
test: Add fp8kv to DS-v3-lite integration tests. ( #3950 )
...
* Add fp8 kv cache tests to DSV3-Lite integration tests.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Refactor. Make fp8kv parallel to attention_dp, overlap_scheduler and cuda_graph.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update gsm8k.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update CI list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update TestDeepSeekR1.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Fix test list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Need quant_config besides pytorch_config.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update waive list (bug 5239087).
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update waive list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Correct test name.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
* Update waive list.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
---------
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <bobboli0202@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-09 13:35:04 +08:00
Ivy Zhang
c91d03fa0a
test: move mistral / mixtral test cases in QA test list into the new accuracy test suite ( #3440 )
...
* add mistral-7b-v0.1 torch flow test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* rearrange mistral
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* rearrange mixtral case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove api function test
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* move mistral nemo cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* move mixtral cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update threshold
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix failure
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix name
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix failure cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update list
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update threshold
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove awq llmapi test
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* adjust threshold
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix ci
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix partial comments
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix path
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update thres
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove duplicate test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix ci
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
---------
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-09 13:32:02 +08:00
Ivy Zhang
c2d4c2adb6
[ https://nvbugspro.nvidia.com/bug/5260676 ]test: skip fp8 quantization case for pre-ada ( #4095 )
...
skip pre ada
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-09 13:30:16 +08:00
Yi Zhang
91bf5e6a8e
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support ( #3804 )
...
Add Piecewise CUDA Graph Support
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-09 11:04:01 +08:00
Stanley Sun
fb31f91e15
test: add qwen3 and disaggregated serving accuracy tests to qa test list ( #4083 )
...
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-05-09 11:03:02 +08:00
Yukun He
5b61486d87
chore: Clean up the legacy DeepseekAllreudceFusionOp. ( #4081 )
...
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-09 10:20:41 +08:00
pcastonguay
836c142e1b
[feat] Allow overriding cli args with yaml file in trtllm-serve ( #4164 )
...
feat: Allow overriding cli args with yaml file in trtllm-serve
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-08 21:19:05 -04:00
Mike Iovine
9afe510367
[fix] Fix llama4 + eagle3 ( #3998 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-08 19:20:27 -04:00
chenfeiz0326
7f5716ef83
Cherry-pick trtllm-gen from feat/llama4 to main ( #4086 )
...
* feat: TRT-LLM Gen FP8 MoE Llama4
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* feat: TRT-LLM Gen llama4 MoE Top1 routing
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
* feat: add per tensor FP8 TRT-LLM Gen GEMMs
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
* Update
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Update
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Add guard for routingIndicesClusterKernel
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Guard sm90+ for routingkernels
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
* Guard sm90+ for routingkernels
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
---------
Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
Signed-off-by: Jiqun Tu <jtu@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
Co-authored-by: Nikita Korobov <nkorobov@nvidia.com>
Co-authored-by: Jiqun Tu <jtu@nvidia.com>
2025-05-08 14:13:01 -07:00
shaharmor98
7d94c9561f
feat: support multi lora adapters and TP ( #3885 )
...
* support multi lora, tp
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-08 23:45:45 +08:00
Yuan Tong
5b93273156
feat: adopt new logprob definition in PyTorch flow ( #4057 )
...
feat: align logprob definition of PyTorch flow
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
2025-05-08 20:16:40 +08:00
Enwei Zhu
74df12bbaa
[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtllm-eval ( #3946 )
...
* fix formula
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* update doc
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* 1st version
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* polish
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-08 19:35:23 +08:00
Ivy Zhang
7666bec7c4
[TRTQA-2861][test]: add nemotron and llama4 cases into qa test ( #4053 )
...
* add MMLU, GPQADiamond check for llama-4 models
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add nomotron cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* add online quant test cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* remove trt flow cases
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update threshold
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* adjust parallelism strategy
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix fail
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* update sanity list
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* fix comment
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
* skip nemotron-h test case
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
---------
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-08 18:10:41 +08:00
xinhe-nv
4468158be4
test: [CI] remove closed bugs ( #4046 )
...
update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-08 18:04:43 +08:00
Yiqing Yan
ce8832e80f
[Infra] Waive L0 flaky test ( #4148 )
...
Waive L0 test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-08 17:23:45 +08:00
yuanjingx87
6e1d2a1320
feat: Add Slurm support and enable RTX Pro 6000 testing pipeline in CI ( #4019 )
...
* Add slurm support with RTXPro6000 PostMerge Tests
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
* remove H100 post merge test from testing
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
---------
Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
2025-05-08 15:15:36 +08:00
Enwei Zhu
dae6781494
test: Waive disagg accuracy test ( #4124 )
...
* waive
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* waive
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-08 13:39:07 +08:00
Enwei Zhu
19eeef1ddd
test: Waive test_llm cases ( #4136 )
...
* waive
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* Apply suggestions from code review
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-08 11:53:16 +08:00
Ivy Zhang
d7c51c953b
test: add INTEGRATION_TEST env var to speed up integration test ( #3618 )
...
add INTEGRATION_TEST env var
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-05-08 10:44:50 +08:00
ruodil
4d0e462723
tests: skip writing prepare_dataset output to logs, and add llama_v3.1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models ( #3864 )
...
* tests: skip writing prepare_dataset output to logs
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
* test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
---------
Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-07 13:56:35 +08:00
Yan Chunwei
0c26059703
chore: Cleanup deprecated APIs from LLM-API (part 1/2) ( #3732 )
...
* beam_width and max_new_token
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* remove beam_width
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* remove min_length
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* remove return_num_sequences
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-07 13:20:25 +08:00
Enwei Zhu
c28b90984f
[TRTLLM-3925, https://nvbugs/5245262 ] [fix] Normalize LLM.generate API ( #3985 )
...
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-07 11:06:23 +08:00
Venky
62fea1e885
test(perf): Add Llama-3.1-Nemotron-8B-v1 to perf tests ( #3822 )
...
* **Model:** Llama-3.1-Nemotron-Nano-8B-v1
* **Precision:** float16
* **Environment:**
* GPUs: 1 H100 PCIe
* Driver: 570.86.15
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:128,128`
* **Request Throughput:** 81.86 req/sec
* **Total Token Throughput:** 20956.44 tokens/sec
* **Average Request Latency:** 5895.24 ms
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:2000,2000`
* **Request Throughput:** 1.45 req/sec
* **Total Token Throughput:** 5783.92 tokens/sec
* **Average Request Latency:** 211541.08 ms
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:128,128`
* **Request Throughput:** 52.75 req/sec
* **Total Token Throughput:** 13505.00 tokens/sec
* **Average Request Latency:** 5705.50 ms
* **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:2000,2000`
* **Request Throughput:** 1.41 req/sec
* **Total Token Throughput:** 5630.76 tokens/sec
* **Average Request Latency:** 217139.59 ms
Signed-off-by: Venky Ganesh <gvenkatarama@nvidia.com>
2025-05-06 17:17:55 -07:00
Erin
cba1793cda
cleanup logprob params ( #4039 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-07 00:50:16 +08:00
Robin Kobus
72057a0a64
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime ( #3625 )
...
* disable overlap in encoder
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: invokeGatherBatch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: overlap same batch
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: add enableTrtOverlap to ExecutorConfig
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* disable overlap for beam search and spec decode
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* skip overlap tests with beam search or speculative decoding
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* enable overlap in GptChunkedLongContextTests
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Enable overlap in gptManagerBenchmark
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Improve early exit
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use OptionalRef for newOutputTokens tensor
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* feat: Add overlap scheduling support to TRTLLMDecoder
- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* fix: allNewTokens in PP
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 15:06:46 +02:00
dominicshanshan
3ac6637005
fix: trtllm-serve hang in stress test and ds v3 stress parameter update ( #3836 )
...
* Remove stdout pipe for genai-perf and make stress time as public parameter.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* Update llmRequest based on comment.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
* launch process function refactor.
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
---------
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-05-06 16:52:30 +08:00
Jinyang Yuan
eeb6c0c83f
[fix] Loosen the thresholds of test_attention_mla ( #4074 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-05-06 11:31:09 +08:00
Suyog Gupta
ac2ab9ba36
[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy ( #4024 )
...
* reuse batch_indices, positions across layers
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* fix flashinfer unit tests
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* simplify call to get_batch_indices_positions
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
* fix call to get_batch_indices_positions
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
---------
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
2025-05-06 10:46:36 +08:00
bhsueh_NV
5c0f554b9e
doc: update qwen3 document ( #4073 )
...
* update qwen3 document
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* remove useless codes
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-06 08:42:51 +08:00
bhsueh_NV
e053cb651b
Fix: fix bug of qwen3 moe ( #4058 )
...
* fix bug of qwen3 moe
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* update threshold
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-06 08:20:15 +08:00
pansicheng
e84dc6b3c7
feat: add deepseek-r1 reasoning parser to trtllm-serve ( #3354 )
...
* add deepseek-r1 reasoning parser
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
* fix test
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
---------
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-06 08:13:04 +08:00
Iman Tabrizian
85867d76dd
test: Add disaggregated serving accuracy tests ( #4036 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-05 08:56:59 -07:00
Yanchao Lu
5ee38ad92a
[Test]: Clean up stale waives ( #4062 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-05 22:13:12 +08:00
Yanchao Lu
ddfb0fe4e2
[Test]: Waive unsupported tests ( #4059 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-05 20:51:49 +08:00
Yiqing Yan
b5c2327aa0
Waive L0 tests ( #4051 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-05-05 12:53:21 +08:00
Yukun He
aa38e28cfa
fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and pp are enabled. ( #3988 )
...
* Fix AllReduce kernel hang issue when both tp and pp are enabled.
Allocate one workspace for each pp rank to avoid potential race.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* update waive list
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-05 11:33:25 +08:00
Yan Chunwei
bc0cf41592
chore: refactor llmapi e2e tests ( #3803 )
...
* refactor llmapi e2e tests
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-05 07:37:24 +08:00
Emma Qiao
2692daad2e
infra: Remove the WAR for test items incompletely ( #3313 )
...
* Remove the WAR for test items incompleted
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Complete test item manually
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test definition file
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Complete test name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix some other test names
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Update name for waived case name, too
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix name for multi-gpu tests
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix another test name
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix typo
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix test name after rebase
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix other qa tests
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
* Fix tests name after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix name after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Correct test names in waive.txt
Signed-off-by: qqiao <qqiao@nvidia.com>
* Add new test_durations file
Signed-off-by: qqiao <qqiao@nvidia.com>
* Fix names after rebase
Signed-off-by: qqiao <qqiao@nvidia.com>
* Update test duration to latest
Signed-off-by: qqiao <qqiao@nvidia.com>
---------
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-04 11:31:59 +08:00
Robin Kobus
403370af62
refactor: Move ModelSpec to core library ( #3980 )
...
* refactor: Move ModelSpec from tests to core library
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Move ModelSpec from runtime to separatedir
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Use new bindings path and clean up
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Updated licenses
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* chore: Remove script_dir from path
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-04 01:39:09 +08:00
qixiang-99
bf4f7ad744
feat: add Pytorch support of Vision Encoder for multimodal models ( #3791 )
...
* feat: Add rename_weights_with_regex function for dynamic weight key renaming
Introduced a new utility function to rename weight keys in a dictionary based on regex pattern matching. This allows for flexible mapping of keys from Hugging Face naming conventions to TRT-LLM naming conventions, enhancing model compatibility and usability.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Implement SiglipVisionModel and related components
Added the SiglipVisionModel along with its associated classes, including SiglipAttention, SiglipEncoderLayer, and SiglipEncoder.
Additionally, a new test suite for the SiglipVisionModel has been created to ensure compatibility with Hugging Face outputs.
Currently SiglipVisionModel support batch size larger than one. Also, inputs and outputs shape are same with the HF for compatibility.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Add CLIPVisionModel and associated components
Introduced the CLIPVisionModel along with its related classes, including CLIPAttention, CLIPEncoderLayer, CLIPEncoder, and CLIPVisionTransformer. This implementation aligns with Hugging Face's CLIP architecture, ensuring compatibility in input and output shapes.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Enhance CLIPVisionModel with attention metadata preparation and unit tests
Updated the CLIPVisionModel to include a method for preparing attention metadata, simplifying the model's usage. Additionally, added a comprehensive unit test suite for the CLIPVisionModel, ensuring compatibility with Hugging Face outputs and validating model performance across various scenarios.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Refactor SiglipVisionModel with attention metadata preparation and update unit tests
Enhanced the SiglipVisionModel by adding a method to prepare attention metadata, streamlining its usage. Updated unit tests to validate model performance and compatibility with Hugging Face outputs, including adjustments to the configuration and test scenarios.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Remove unused rotary_emb parameter from CLIP and Siglip attention classes
Eliminated the rotary_emb parameter from the CLIPAttention and SiglipAttention classes to streamline the code. Updated unit tests to reflect changes in the model configurations, including clarifications in the default configurations sourced from Hugging Face.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Integrate CLIPVisionModel into LlavaNextInputProcessor and enhance weight loading
Added CLIPVisionModel to the LlavaNextInputProcessor for improved vision processing. Updated the model loading mechanism to ensure compatibility with the new vision model and added attention metadata preparation. Removed debug print statements from weight renaming function for cleaner code.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Remove unused max_position_embeddings from CLIPAttention and update Siglip classes to use CLIP components
Removed the unused max_position_embeddings variable from the CLIPAttention class. Updated the Siglip classes to utilize CLIP components, specifically replacing SiglipEncoder and SiglipAttention with their CLIP counterparts, streamlining the codebase and enhancing consistency across models.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Consolidate weight loading logic into a shared implementation
Refactored the weight loading process across CLIP and Siglip models by using a new utility function, _load_weights_impl, to streamline the loading mechanism. This change enhances code maintainability and reduces redundancy in weight handling, ensuring consistent behavior across different model architectures.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* refactor: Simplify output handling in CLIP and Siglip models by removing output_hidden_states parameter
Removed the output_hidden_states parameter from the CLIPEncoder and SiglipVisionTransformer classes, streamlining the output handling process. Updated the corresponding unit tests to reflect these changes and ensure compatibility with the new output structure.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* feat: Enhance LlavaNextInputProcessor with dynamic model loading and memory optimization
Updated the LlavaNextInputProcessor to support dynamic model loading from local paths or Hugging Face, improving memory efficiency by partially loading the model components. Integrated the LlavaNextMultiModalProjector and adjusted weight loading to ensure compatibility with the new architecture.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
---------
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-03 05:13:47 +08:00
Mike Iovine
906cddffb0
[infra] Improve llama4 parallelism test coverage ( #3821 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-02 16:15:04 -04:00
bhsueh_NV
561ee44737
add ci and doc for qwen3 ( #4022 )
...
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-05-02 14:13:38 +08:00
Simeng Liu
873c7532fd
feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. ( #3438 )
...
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-05-02 13:25:30 +08:00
Lucas Liebenwein
be916b19e0
feat: [AutoDeploy] unfusing attention for native support ( #3668 )
...
* [AutoDeploy] unfused streamlined attention + caching
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* improved unit testing
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* reviewer feedback
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* some updates to attn_mask handling
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* updated manual benchmarking and cudagraph capture
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-02 09:06:49 +08:00
Yukun He
a1645c922b
Fallback to NCCL for various patterns when input size is large. ( #4009 )
...
When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 15:17:16 -07:00
Erin
8fe7bdeacf
feat: LogitsProcessor in PyTorch backend ( #3145 )
...
* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-01 14:15:30 -07:00
Erin
83f37614ef
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI ( #3388 )
...
* support return logprob in llmapi
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
update and add test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
stability test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* revert removal of old flag
Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
2025-05-01 12:47:14 -04:00
xinhe-nv
009d5e9fa3
test: [CI] Add failed cases into waives.txt ( #3943 )
...
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* waive test_llm_commandr_v01_single_gpu_summary for GH200
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-01 23:43:11 +08:00
nv-guomingz
dc344b6a4f
fix: https://nvbugs/5246733 ( #3989 )
...
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
2025-05-01 22:52:31 +08:00
YueWeng
b1621e8d4e
feat: add relaxed acceptance for DS ( #3865 )
...
* add relaxed acceptance for DS R1
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* clean and update docs
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* fix
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* Modified based on review
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
* fix mtp manager issue
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
---------
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-01 21:50:36 +08:00
Kate Cheng
7dbe618683
feat: Add multimodal embedding field in LlmRequest ( #3855 )
...
* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:23:30 +08:00
Yukun He
9cc5922a0b
Clean up allreduce op in Deepseek V3 model. ( #3829 )
...
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-05-01 07:56:36 +08:00
Dom Brown
b40f351b7a
[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests ( #3206 )
...
* Squash of dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Add timer + waive test with suspected GptSession bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Respond to reviewer comments
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
2025-05-01 05:31:08 +08:00
Erin
941e82faa6
waive test_tinyllama_guided_decoding ( #3997 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-04-30 12:46:32 -07:00
Chuang Zhu
1ada3c9800
unwaive disagg tests ( #3925 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-30 16:44:00 +08:00
yuxianq
f568cbb671
chore: Remove duplicated get_sm_version. ( #3935 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-30 11:43:53 +08:00
xinhe-nv
a31afcf3a9
update waive list ( #3890 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-30 11:07:48 +08:00
QI JUN
99929e724b
ci: skip pipeline parallelism test of pytorch flow ( #3947 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-30 01:00:16 +08:00
Pamela Peng
c8649ce3aa
skip blackwell tests for sm120 ( #3815 )
...
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-29 09:53:35 -07:00
xiweny
68a19a33d4
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 ( #3770 )
...
* upgrade cutlass to 3.9
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
* update internal cutlass kernels
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
* fix file
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
* remove unnecessary change
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
* update hash
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
---------
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
2025-04-29 11:19:11 -04:00
Dom Brown
8709fe8b53
chore: bump version to 0.19.0 ( #3598 ) ( #3841 )
...
test: add test cases for 0.19 release (#3608 )
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642 )
fix: nvbugs/5187237: fix deterministic mode crash (#3448 )
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647 )
tests: change qa perf test to trtllm-bench (#3619 )
fix: FP8 quantized lm_head (NvBug 5214229) (#3567 )
infra: Add PR approval protection for the release branch (#3634 )
fix: nvbugs/5231298: pytorch allreduce issue (#3673 )
Fix: nvbugs/5222698 variable not defined (#3630 )
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685 )
test:restore fp8 kv cache testing for L0 (#3671 )
doc: Update DeepSeek perf docs (#3693 )
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664 )
fix: update test_user_buffers_mm_add_prologue atol (#3711 )
Fix: cherry-pick hmac encryption from main branch (#3635 )
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649 )
---------
Un-waive DS-V3-Lite tests. (#3621 )
fix: FP8 kv accuracy (#3675 )
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622 )
unwaive multi-node test (#3721 )
chore : Split more tests out of gpt tests (#3524 ) (#3674 )
doc:add torch examples link into torch backend documentation (#3749 )
test: Get Eagle tests working (#3593 ) (#3722 )
Waive L0 test (#3756 )
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656 )
Update ds v3 parameters in stress test. (#3676 )
waive gemma on L20 (#3766 )
https://nvbugs/5141291 : Fix convert.py script for Qwen model. (#3758 )
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688 )
remove benchmark test list (#3643 )
skip disagg deepseek test if sm!=90 (#3720 )
test: skip failed cases on B200 (#3710 )
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718 )
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800 )
Fix ModelOpt Mixtral AWQ OOM (#3714 ) (#3761 )
Waive L0 tests (#3826 )
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793 )
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652 )
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797 )
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-29 16:57:22 +08:00
QI JUN
c381380ecc
increase H100 CI nodes for PyTorch only pipelines ( #3927 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-29 10:58:43 +08:00
qixiang-99
f370dd0e32
refactor(test): remove random context sequence lengths and set seed for reproducibility in attention tests ( #3919 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-04-29 10:08:04 +08:00
Jinyang Yuan
dafc28fb85
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test ( #3863 )
2025-04-29 09:09:43 +08:00
Erin
0577ea0155
waive test_attention_no_cache ( #3921 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-04-28 13:57:01 -07:00
Mike Iovine
e534bf09cc
[fix] Fix flashinfer + speculation issues ( #3686 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 14:34:22 -04:00
xiweny
f84dd8f815
test: add deepseek v3 & r1 cases ( #3528 )
...
* test: add deepseek v3 & r1 cases
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-04-28 23:37:26 +08:00
Zhenhuan Chen
ad15e45f07
[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding ( #3807 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-04-28 17:15:33 +08:00
xinhe-nv
82a8e43557
test: [CI] Add failed cases into waives.txt ( #3867 )
...
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waives
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Signed-off-by: Larry <197874197+LarryXFly@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-04-28 14:32:48 +08:00
xinhe-nv
e20b67e9fd
update waives & tests ( #3887 )
...
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-28 14:29:35 +08:00
Yanchao Lu
068c72ebf8
Test: waive intermittent test hang ( #3894 )
...
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
2025-04-28 08:53:20 +08:00
Iman Tabrizian
74cc9e26ff
infra: install Triton in the base image ( #3759 )
...
* infra: install Triton in the base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* install Triton from the base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* update base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* Address review comments
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* update base image
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
* waive test
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
---------
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-04-28 07:36:30 +08:00
Yan Chunwei
ad4226d946
fix: trtllm-bench build trt engine on slurm ( #3825 )
...
* add submit_sync to RemoteMpiSessionClient
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
add barrier
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
fix comment
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
disable test
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-27 22:26:23 +08:00
Chuang Zhu
e2318756ed
cacheTransceiver buffer manager ( #3798 )
...
* cacheTransceiver buffer manager
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* fix args
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* cpp kvCacheManager
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
* format
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
---------
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-27 11:48:15 +08:00
milesial
362a8272f8
feat: llama4 input processor ( #3383 )
...
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-25 16:47:14 -07:00
Dom Brown
7ff9fd345c
Test: Split C++ unit tests for CI granularity ( #3868 )
...
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
2025-04-25 13:30:58 -07:00
qixiang-99
ecd621fb0a
feat: Add head size 72 support for QKV Preprocessing kernel ( #3743 )
...
* refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow
- Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels.
- Added support for head size 72 in unfused attention kernels(QKVPreprocessing).
- Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72).
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
* update: Waive head_dim=72 test cases and enhance test representation
- Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues.
- Introduced a custom __repr__ method in the Scenario class for pytest substring match.
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
---------
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-04-25 11:07:40 -07:00
Yiqing Yan
238fefc659
[infra] Waive L0 tests ( #3853 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-04-25 17:32:21 +08:00
dongxuy04
16535991b2
feat: Add MNNVL MoE A2A support ( #3504 )
...
* add MNNVL memory mapping support
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add more MPI environment for trtllm-llmapi-launch
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MoE communication and prepare kernels
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add MNNVL AlltoAll support for DeepSeekV3
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* add output dump for throughput benchmark
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* support dynamic kernel launch grid
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
* address review comments #2
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
---------
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
2025-04-25 17:29:08 +08:00
Yuan Tong
57944206ba
feat: return logits in PyTorch flow ( #3221 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-24 16:56:03 -07:00
QI JUN
991939a0f4
chore: increase A30 for cpp test ( #3811 )
...
* increase A30 for cpp test
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* enable parallel run test for gpt_executor
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* clean
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* decrease freeGpuMemoryFraction of cpp tests
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* fix
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-24 16:34:39 -07:00
Enwei Zhu
777c40e5fa
[ https://nvbugspro.nvidia.com/bug/5238599 ][fix] Normalize example path in accuracy tests
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-24 10:09:59 +08:00
xinhe-nv
476d7003f8
test: [CI] Add failed cases into waives.txt ( #3777 )
...
* update waive list
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
* update waives.txt
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
---------
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-24 09:36:05 +08:00
Mike Iovine
bc5fe7800d
[chore] Fix KV cache block reuse flag name in quickstart_advanced ( #3781 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-24 06:02:47 +08:00
Kaiyu Xie
dfbcb543ce
doc: fix path after examples migration ( #3814 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-24 02:36:45 +08:00
QI JUN
635dcdcb9e
chore: reorganize some unit tests of PyTorch ( #3780 )
...
* reorganize some unit tests of PyTorch
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
* fix ci
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
---------
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-23 11:19:10 -07:00