TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Jinyang Yuan	20d0649f19	[feat] Support XQA-based MLA on SM120 (#4858 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-06-06 22:32:49 +08:00
Fanrong Li	75d020cf07	fix: fix cuda graph padding for spec decoding (#4853 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-06 22:21:42 +08:00
Anthony Chang	eeb555e37b	chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-06 16:13:54 +08:00
QI JUN	1b963c17c0	CI: waive test_llm_multi_node_with_postproc (#4977 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-06 14:19:56 +08:00
xinhe-nv	564472168e	test: [CI] Add failed cases into waives.txt (#4966 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-06-06 10:30:15 +08:00
juney-nvidia	a761cc2f8d	doc: refinement based on Julien's feedbacks (#4967 ) Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-06-06 08:56:14 +08:00
QI JUN	ec50684d80	Revert "fix a bug of global cuda graph dummy request" (#4970 )	2025-06-06 08:54:45 +08:00
juney-nvidia	37ac564190	doc: expose Large-scale EP design and implementation tech blog in the main… (#4960 ) Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-06-05 22:51:25 +08:00
Yiteng Niu	d2c311c9d3	infra: update jnlp version in container image (#4944 )	2025-06-05 22:36:10 +08:00
Kaiyu Xie	5a5427f86e	blog: Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP) (#4958 ) Signed-off-by: juney-nvidia <143764042+juney-nvidia@users.noreply.github.com> Co-authored-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-06-05 22:24:04 +08:00
qsang-nv	180b91f957	update fmha_v2 (#4895 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-05 22:14:28 +08:00
dongjiyingdjy	51652b9b2b	feat : add PositionEmbeddingType=0 to xqa support (#4934 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-05 21:50:42 +08:00
QI JUN	bfa877a22e	Fix: fix autodeploy (#4957 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 21:06:55 +08:00
QI JUN	154f7cc40a	fix a bug of global cuda graph dummy request (#4894 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 19:47:40 +08:00
Yiqing Yan	7e921c78b5	Waive L0 tests (#4953 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-05 19:36:48 +08:00
Shunkangz	3eae58ca36	Add disaggregated unittest (#4899 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-06-05 19:14:31 +08:00
ixlmar	a1526356aa	[TRTLLM-5630] restore free_gpu_memory_fraction=0.9 in tests (#4859 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-05 10:46:29 +01:00
QI JUN	b8c5e3892b	Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 17:43:30 +08:00
QI JUN	d5a8079eb6	Revert "[infra] Unwaive unittests/_torch" (#4950 )	2025-06-05 17:21:07 +08:00
Lucas Liebenwein	743fb0a159	[AutoDeploy] _AutoDeployLlmArgs as primary config object (#4891 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-05 17:20:55 +08:00
QI JUN	91e8d43d66	CI: waive test_llm_get_queued_stats (#4945 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-05 16:44:56 +08:00
ixlmar	6437756da8	fix: handle OOMs during KV cache estimation (#4690 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-05 10:02:26 +02:00
xinhe-nv	1c3091c63b	tests: [TRTQA-2906] add benchmark serving tests (#4901 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-06-05 14:33:03 +08:00
Netanel Haber	ddbaa5ef80	Only pass `fast_build=true` to non-pytorch backend (#4920 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-05 13:30:17 +08:00
Yiqing Yan	9ceef983c0	Waive L0 tests (#4927 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-06-05 11:09:01 +08:00
xinhe-nv	50a74a1daa	tests: fix 5273697 (#4685 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-06-05 10:39:21 +08:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Mike Iovine	8433091630	[infra] Unwaive unittests/_torch (#4919 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-05 08:49:37 +08:00
Lucas Liebenwein	f9d45e03a4	[AutoDeploy] deprecate CI post-merge tests and keep them for local testing (#4892 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-06-05 08:27:17 +08:00
Yan Chunwei	8e0d96fcc6	fix: LLM invalid arg in a test (#4922 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-05 08:00:32 +08:00
Yuxian Qiu	6b3242654e	fix: Fix broken vanilla moe since FusedMoE refactor. (#4897 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-05 03:56:41 +08:00
Yi Zhang	1fca654bfd	tests: Update gb200 test case (#4754 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-06-04 18:49:20 +08:00
ixlmar	2bbb6b5976	chore: introduce KvCacheCreator (#4581 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-04 11:03:17 +02:00
Xianjie Qiao	325ccaae3d	Fix trtllm-bench iter_stats and cuda_graph_batch_sizes error errors. (#4827 ) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-06-04 16:36:07 +08:00
Zheng Duan	dd2191c5b3	fix: correct the order of llm request state (#4781 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 14:45:13 +08:00
Joosung Yoon	4954780649	Fix: draft target README and set exclude_input_in_output to False (#4882 ) Signed-off-by: Joosung Yoon <joosungy@nvidia.com>	2025-06-03 23:45:02 -07:00
Zhanrui Sun	35e87b99f3	chore: bump version to 0.21.0rc1 (#4896 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-06-04 14:31:18 +08:00
tomeras91	8d31e16877	[TRTLLM-4923][feat] Paged mamba cache (#4822 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-04 09:27:08 +03:00
Omer Ullman Argov	e71de2a13e	chore: Mass integration of release/0.20. (#4871 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-06-04 14:12:27 +08:00
Yan Chunwei	ac20159d32	fix: build_config in TorchLlmArgs and avoid invalid args (#4600 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-04 13:17:29 +08:00
QI JUN	e2eea80c1d	Chore: refine comments of prepare inputs method of model engine (#4837 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-04 12:14:13 +08:00
Yukun He	5fa6fbd989	feat: Enhance AutoTuner inference path and code readability (#4466 ) Fix AutoTuner warmup request generating. * The current warmup phase creates one request, which is insufficient for the warmup to cover the max_num_tokens. Revise the warmup phase to a batch of requests to cover the max_num_tokens to eliminate potential fallback cases. Refactor AutoTuner API and reduce host overhead. Refine (min, opt, max) values of optimization profile setup for get_valid_tactics to achieve the correct canImplement definition. * Refine cache key assembly process to reduce host overhead and simplify API. * Fix lru_cache usage to reduce host overhead. * Move tuning config initialization as a one-time object in tunable runner to reduce host overhead. Improve tuning config readability. * Use dataclass to define tuning config. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-04 10:53:11 +08:00
Zheng Duan	ded694b1aa	feat: cache reuse support (selective cache transfer) in mla cache formatter (#4749 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 09:56:31 +08:00
Shi Xiaowei	b13f8c9cba	Fix: NVBug 5302895 (#4835 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-04 09:31:39 +08:00
Shunkangz	c835f06371	Refactor the first token response in PD (#4692 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-06-04 09:11:23 +08:00
ChristinaZ	d64af85e8c	Replace memset with data initialization within kernels (#4851 ) Signed-off-by: Christina Zhang <christinaz@nvidia.com>	2025-06-04 08:56:46 +08:00
Mike Iovine	73389d6531	[fix] Fix llama 4 long context (#4809 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-04 07:48:08 +08:00
Perkz Zheng	a089aa3225	[https://nvbugspro.nvidia.com/bug/5300080 ] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-03 19:02:57 -04:00
Nikita Korobov	8043d7a03c	feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-06-03 14:07:54 -07:00
rakib-hasan	d0eb47d33a	[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation (#4506 ) * refactoring the multimodal input prep Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * adding out-of-tree override option Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * adding exceptional case for llava-next Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fixing typo Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments, adding placement option, handling tokenizer variations Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing pytest-asyncio behavior change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-06-03 12:02:07 -07:00

1 2 3 4 5 ...

1220 Commits