TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
Yi Zhang	9b616db13b	test: Add fixture to skip tests based on MPI world size (#5028 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-06-16 11:25:01 +08:00
ruodil	2848e012ae	test: add llama4 models for perf test (#5187 ) Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-06-16 11:24:35 +08:00
ruodil	3d22f27063	test: add more cases for llama_v3.3/3.1 70b fp8 and set enable_attention_dp to false to non-deepseek models (#5155 ) Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>	2025-06-16 11:23:20 +08:00
Enwei Zhu	babdd9ce06	test: Add json_mode_eval for guided decoding evaluation (#5179 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-16 10:03:55 +08:00
Yilin Fan	7a5e0fd300	[fix] Fix Llama4 min-latency import error (#5209 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-16 10:03:07 +08:00
Yan Chunwei	c84e41fd9d	fix: build_config in TorchLlmArgs and avoid arbitrary args (#4972 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-06-15 17:51:56 -07:00
amitz-nv	109c426077	Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130 )	2025-06-15 18:54:04 +03:00
Fanrong Li	39bba63758	[TRTLLM-4983] feat: enable overlap scheduler between draft forwards (#4802 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-15 23:09:16 +08:00
qsang-nv	5a01ba5260	use cu for fmha_v2 (#4694 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-15 18:40:44 +08:00
Omer Ullman Argov	4eade3ae33	[fix][test] Speedup Nemotron NAS unittests (#5202 ) Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com>	2025-06-15 11:26:03 +03:00
Fanrong Li	159ffc584e	fix: fix cuda graph max batch size for spec decoding cases. (#5076 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-15 14:57:28 +08:00
Kaiyu Xie	dce1dcc4f9	feat: Support post_proc for bench (#5122 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-15 13:02:38 +08:00
Enwei Zhu	63bc62ddf4	feat: Enable EPLB to existing MoE models (#5203 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-15 11:48:06 +08:00
Yuan Tong	6bce7337a9	perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-06-15 07:45:02 +08:00
ixlmar	e055af1bc9	chore: improve disagg test failure detection (#4738 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-06-15 01:28:26 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
2ez4bz	dc52b67492	linting(python): Enable ruff on more files (wave 1/N) (#5140 ) Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>	2025-06-14 19:19:34 +08:00
Tailing Yuan	0b60da2c45	feat: large-scale EP(part 7: DeepEP integration) (#4792 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-14 19:12:38 +08:00
Robin Kobus	443b2eb51f	refactor: Speculative decoding buffers (#5091 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-14 11:39:32 +02:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
nv-guomingz	3b7b5a5ad5	refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) (#5032 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-14 14:23:13 +08:00
dongxuy04	97657bfda2	optimize memset before alltoall communication (#5188 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-14 10:49:47 +08:00
Aurelien Chartier	82e280f6f3	feat: add multi-node support for Triton with pytorch backend (#5172 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-13 13:27:58 -07:00
Enwei Zhu	5f2785fb90	fix: Fix waive list (#5205 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-13 23:33:23 +08:00
Yilin Fan	06342ffb4d	[feat] Implement model-agnostic one-engine eagle3 (#4778 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-13 08:11:41 -07:00
Mike Iovine	25aa3881d7	[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len (#4879 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-06-13 11:06:36 -04:00
Perkz Zheng	3d87770e15	[https://nvbugspro.nvidia.com/bug/5295470 ] support headDim 256 for blackwell fmha kernels (#5164 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-13 23:01:01 +08:00
QI JUN	952f33dcad	CI: move all test cases of TensorRT backend into post merge (#5186 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-13 20:48:48 +08:00
Chuang Zhu	8e9937081d	ucxx only use ucp_feature_tag to aviod some issuse on some platform (#4994 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-13 19:14:25 +08:00
yunruis	e5be3a95b3	fix: fix license bug (#5200 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-13 18:58:15 +08:00
yunruis	e96d6863d8	add doc for open-sourced cutlass kernels (#5194 ) Signed-off-by: yunruis	2025-06-13 18:51:27 +08:00
brb-nv	089be8912a	feat: Basic skeleton for Gemma3 VLM (#5108 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-06-13 17:27:04 +08:00
xinhe-nv	30d9d0fa71	test: [CI] Add failed cases into waives.txt (#5178 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-06-13 16:38:51 +08:00
nv-guomingz	b959618579	refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig (#5031 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-06-13 16:34:24 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Yao Yao	12e075eb70	[nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime (#5133 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-13 15:53:29 +08:00
Matthias Jouanneaux	514baf1287	[fix] Fix comment to pass guardwords check (#5191 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-06-13 15:49:59 +08:00
Zheng Duan	4d0a5ad384	chore: gracefully exit disagg process in tests; better startup and logging (#5109 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-13 14:03:55 +08:00
Ivy Zhang	28cd536bd6	[test] Update timeout params in QA test list (#5124 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-13 13:40:03 +08:00
Iman Tabrizian	01bd4c00b4	Add two MTP disaggregated test (#4546 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-06-13 12:17:45 +08:00
Daniel Cámpora	dec326ba7d	[fix] Reenable test return logits (#5160 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-13 06:07:22 +02:00
Yibin Li	b79eb34bfe	[fix]: Fall back to HMAC to Avoid IPC Serialization Churn (#5074 ) Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-06-13 11:37:50 +08:00
xinhe-nv	d9be419f45	tests: update tests for b200 (#5180 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-06-13 11:25:33 +08:00
ruodil	fa582cbe9a	test: add more cases for rtx_pro_6000_se and add option kv_cache_dtype in perf test (#5083 ) Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com>	2025-06-13 11:09:15 +08:00
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Yuxian Qiu	4ae46b6714	fix: [nvbugs/5324229] Fix broken WInt4AFP8FusedMoEMethod since FusedMoE refactor. (#4930 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-06-13 10:21:32 +08:00
Fanrong Li	38a907aaca	[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance (#5119 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-13 08:58:44 +08:00
Matthias Jouanneaux	a0b6c635b1	[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-06-13 06:00:02 +08:00
Xiaodong (Vincent) Huang	cc2a1344be	None: fix OOM because of unnecessary mha workspace (#5056 ) Signed-off-by: Vincent Huang <vincenth@nvidia.com>	2025-06-12 21:56:05 +02:00

1 2 3 4 5 ...

1363 Commits