TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-25 05:02:59 +08:00

Author	SHA1	Message	Date
chenfeiz0326	5cd8c0f6cc	[None][test] Add perf-sweep scripts (#6738 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-14 14:04:47 +08:00
Tao Li @ NVIDIA	345d3d3524	[None][doc] update moe support matrix for DS R1 (#6883 ) Signed-off-by: taoli <litaotju@users.noreply.github.com> Co-authored-by: taoli <litaotju@users.noreply.github.com>	2025-08-14 13:55:11 +08:00
NVJiangShao	a700646132	[None][fix] Add FP4 all2all unitest and fix a bug for module WideEPMoE (#6784 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-08-14 13:35:37 +08:00
Yan Chunwei	0132c1db84	[https://nvbugs/5427043 ][fix] request length exceeds max_num_tokens (#6821 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-08-14 13:31:12 +08:00
Bo Deng	d8acca495b	[TRTLLM-6675][infra] Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6623 (#6735 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-08-14 04:36:38 +00:00
jmydurant	4200fa46d1	[None][feat] Add support for Hopper MLA chunked prefill (#6655 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-08-14 10:39:26 +08:00
Zhenhua Wang	868c5d166e	[None][chore] fix markdown format for the deployment guide (#6879 ) Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>	2025-08-13 22:19:11 -04:00
Izzy Putterman	ef53de8eef	[None][feat] Add test for speculative rejection sampler (2-model) (#6542 ) Signed-off-by: Izzy Putterman <iputterman@nvidia.com>	2025-08-13 22:09:35 -04:00
Linda	eb4ed18a63	[None][fix] max_num_sequences argument in nanobind (#6862 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-08-13 19:16:17 -04:00
Mike Iovine	7cba883932	[https://nvbugs/5410399 ][chore] Unwaive mtp llmapi test (#6833 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-08-13 17:38:45 -04:00
Perkz Zheng	58f7783ea4	[https://nvbugs/5394685 ][fix] the bug with spec-decoding + SWA && an accuracy issue related to 2CTA MLA (#6834 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-13 13:55:56 -07:00
Tin-Yin Lai	6c52bb07ff	[https://nvbugs/5302040 ][feat] Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527 ) Signed-off-by: tinyinl <tinyinl@nvidia.com>	2025-08-13 11:19:13 -07:00
danielafrimi	bda42f8c3a	[None][feat] Support running heterogeneous model execution for Nemotron-H (#6866 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-08-13 19:51:19 +03:00
Emma Qiao	c7e6145409	[None][infra] Waive failed cases on main (#6863 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-08-13 09:50:14 -04:00
Anthony Chang	2198587b35	[https://nvbugs/5378031 ] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-08-13 21:24:40 +08:00
Zhenhua Wang	8416d7fea8	[https://nvbugs/5412885 ][doc] Add the workaround doc for H200 OOM (#6853 ) Signed-off-by: Zhenhua Wang <4936589+zhenhuaw-me@users.noreply.github.com>	2025-08-13 19:51:38 +08:00
Perkz Zheng	0fad6029f7	[TRTLLM-7093][fix] the perf regression to cvt_fp4 kernels (#6851 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-13 19:13:40 +08:00
Shi Xiaowei	fe7dda834d	[TRTLLM-7030][fix] Refactor the example doc of dist-serving (#6766 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-08-13 17:39:27 +08:00
Yukun He	bc5f766e0e	[TRTLLM-4501][feat] AutoTuner tuning config refactor and valid tactic generalization. (#6545 ) * Generalize the definition of tactics so that users can implement more customizable tactic types, making the configurations clearer for each kernel run. * Allow the user not to specify the `gen_tuning_buckets` or the `map_to_tuning_buckets` function. * Other code refactoring. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-13 16:25:22 +08:00
Void	1d80df0955	[None][feat] DeepEP LL combine FP4 (#6822 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-13 04:20:21 -04:00
Zhou Yuxin	50e5e725e9	[https://nvbugs/5412456 ][fix] Fix an illegal instruction was encountered (#6776 ) Signed-off-by: Zhou Yuxin <yuxinz@nvidia.com>	2025-08-13 15:45:59 +08:00
Aurelien Chartier	2e0081b53e	[#6530 ][fix] Fix script when using calibration tensors from modelopt (#6803 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-08-12 20:41:10 -07:00
Mike Iovine	f68e03e646	[https://nvbugs/5452167 ][fix] Fix ngram padding issue (#6837 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-08-13 11:23:16 +08:00
Yechan Kim	12102e2d48	[TRTLLM-6772][feat] Multimodal benchmark_serving support (#6622 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-08-12 19:34:02 -07:00
Fanrong Li	1bbc0e323b	[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc (#6811 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-08-13 10:27:57 +08:00
Kaiyu Xie	47806f09d9	feat: Support custom repo_dir for SLURM script (#6546 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: xxi <xxi@nvidia.com>	2025-08-12 22:06:59 -04:00
rakib-hasan	2923eb88a1	[None][fix] Refactoring input prep to allow out-of-tree models (#6497 ) Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-08-12 20:29:10 -04:00
dongxuy04	bd9a6dd9ab	[TRTLLM-7008][fix] fix wideEP weights loading and args (#6789 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-08-12 19:14:20 -04:00
Robin Kobus	45c7518032	[None][refactor] Simplify decoder state initialization (#6559 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-12 21:44:41 +02:00
Robin Kobus	dd11e08d26	[#6187 ][feat] add LayerNorm module (#6625 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-12 21:43:30 +02:00
nvchenghaoz	81f0ded1c4	[None][feat] Add GPT OSS support for AutoDeploy (#6641 ) Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>	2025-08-12 14:03:22 -04:00
Jhao-Ting Chen	a060e12041	[https://nvbugs/5438869 ][fix] Set nvfp4 expert w1 w3 weight scale to the same value if they're not (#6656 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-08-12 20:47:10 +08:00
xinhe-nv	e35fca4272	[TRTQA-2920][chore] improve hang tests (#6781 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-08-12 18:26:51 +08:00
QI JUN	8845e0f065	[None][fix] fix ci (#6814 )	2025-08-12 02:21:50 -07:00
Shunkangz	ab0d768acf	[None][fix] Fix attention dp log (#6570 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-08-12 04:53:09 -04:00
Liao Lanyu	f7c13a4aa7	[TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp (#6745 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-08-12 16:45:16 +08:00
Sergey Klevtsov	27fc35175e	[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294 ) Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-12 15:56:48 +08:00
Fridah-nv	0dc4b4e699	[#4403 ][autodeploy] Refactor: Move more transformations to new inf optimizer, Add quantization_source to factory interface (#6760 ) Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>	2025-08-11 22:02:46 -07:00
Enwei Zhu	7c686ba8de	[TRTLLM-2285][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-08-12 09:30:06 +08:00
Ziyi Xiong	b4fcd5f592	[https://nvbugs/5441438 ][fix] Set correct draft length for the cuda graph dummy request (#6701 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-08-12 09:28:47 +08:00
Jinyang Yuan	ead89a0e40	[None][perf] Improve the performance of online EPLB on Hopper by better overlapping (#6624 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-08-12 09:25:13 +08:00
Chang Liu	be9dd4713c	[https://nvbugs/5385987 ][fix] Fix Qwen2 quantization issue by pinning transformers version (#6673 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-08-11 17:16:49 -07:00
Aurelien Chartier	56bfc3a6d2	[None][chore] Find LLM_ROOT and LLM_BACKEND_ROOT dynamically (#6763 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-08-11 15:18:19 -07:00
rakib-hasan	7ab8112450	[None][fix] Refactoring to avoid circular import when importing torch models (#6720 ) Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-08-11 18:00:42 -04:00
Venky	c9fe07ede6	[TRTLLM-6812][feat] Add standardized GitHub issue templates and disable blank issues (#6494 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-08-11 13:08:48 -04:00
Zhenhua Wang	7e33ed6d61	[None][chore] always try-catch when clear build folder in build_wheel.py (#6748 ) Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>	2025-08-11 14:02:17 +02:00
Emma Qiao	5145e9d40e	[None][infra] Unwaive an updated case to test (#6791 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-08-11 06:47:33 -04:00
Liao Lanyu	a2e9153cb0	[None][doc] Add K2 tool calling examples (#6667 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-08-11 16:25:41 +08:00
bhsueh_NV	83dbc6c75d	[TRTLLM-5532][feat] store the block of context request into kv cache (#6683 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-08-11 16:14:52 +08:00
Martin Marciniszyn Mehringer	9a8195ef88	fix: Ensure that Python stub generation works against libnvidia-ml stubs (#6188 ) Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>	2025-08-11 09:18:17 +02:00

1 2 3 4 5 ...

2318 Commits