TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Xiwen Yu	5e7aa76bb4	Merge branch 'user/sm103_trtllmgen' into feat/b300_cu13 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-06 00:49:23 +08:00
Robin Kobus	a95d9616ba	[#6186 ][feat] Introduce QKNormRoPEAttention module (#6830 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-09-05 14:04:41 +02:00
Xiwen Yu	2c3f4cbeee	Merge remote-tracking branch 'origin/main' into feat/b300_cu13 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-05 15:53:43 +08:00
Xiwen Yu	22219bc37e	Add B300 & GB300 CI Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-05 15:29:50 +08:00
Jin Li	2189a2f3ff	[https://nvbugs/5483615 ][fix] Remove unnecessary assertion to let mai… (#7441 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-09-05 10:56:21 +08:00
sychen52	98a1bffb7c	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-04 09:03:38 -07:00
tomeras91	9c8d2161d0	[None][doc] fix example in docstring (#7410 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-09-02 11:59:49 +03:00
Xiwen Yu	62a78973a8	Merge remote-tracking branch 'origin/main' into user/xiweny/merge_0901 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-02 10:12:30 +08:00
Xiwen Yu	38ef850552	Merge remote-tracking branch 'gitlab/main' into user/xiweny/merge_0901 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-09-01 11:46:44 +08:00
Aurelien Chartier	93e623b455	[https://nvbugs/5449155 ][fix] Fix DeepSeek R1 weight loading for TP16 (#6913 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Jiagan Cheng	8d5a7ea5b3	[https://nvbugs/5443053 ][fix] Disable finalize fusion when Lora is used Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>	2025-08-31 18:28:09 -07:00
Tian Zheng	e257cb3533	[None][feat] Support NVFP4 KV Cache (#6244 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-09-01 09:24:52 +08:00
Zongfei Jing	a7ed26dd8b	[TRTLLM-6747][feat] Merge add sparse exp and shared exp into local reduction (#7369 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-31 21:20:00 -04:00
Fanrong Li	37a1bd810f	[https://nvbugs/5481385 ][fix] Fix max_seq_len in cuda graph warmup and intermediate_size in fused_moe_deepgemm (#7345 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-08-29 17:00:43 +08:00
Nikita Korobov	a419b77fb5	[None][fix] mxfp4 padding bug for TRT-LLM and CUTLASS MoE backends (#7214 ) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-08-28 10:08:05 -07:00
Zongfei Jing	53163bf1df	[TRTLLM-6876][feat] Add low precision all2all for mnnvl (#7155 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-28 18:26:16 +08:00
Yukun He	bed5bc9f2e	[None][chore] Wrap the swiglu into custom op to avoid redundant device copy. (#7021 ) A redundant D2D copy is observed when enabling torch.compile for the Llama model due to the swiglu triton kernel, which brings perf overhead. Use a custom op to wrap the swiglu op to avoid this overhead. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-27 13:02:10 +08:00
Fanrong Li	e12868bc00	[None][fix] Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-08-27 10:35:38 +08:00
Jin Li	028235404b	[TRTLLM-6633][feat] Padding for piecewise cudagraph (#6750 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-26 18:31:33 -04:00
Void	040f4c70d3	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-27 00:13:13 +08:00
Bo Li	bf1b958f1a	[TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com> Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-25 16:52:30 -04:00
Yukun He	9c5b464fe0	[None][feat] Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time. (#7113 ) Because deep_gemm.gp8_gemm_nt will trigger many JIT processes during the inference phase, we need to sweep these shapes ahead of time. Apply the AutoTuner framework to achieve this and retain the potential capability to tune the swap_ab flag. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-25 10:48:31 +08:00
dongxuy04	19a0ea363b	[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP (#6973 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: Dongxu Yang <dongxuy@nvidia.com> Co-authored-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-08-24 08:15:29 -04:00
tomeras91	c232ba8157	[TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H (#6334 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-08-22 12:15:20 -04:00
ChristinaZ	c7269ea93a	[https://nvbugs/5392414 ] [fix] Add customized default routing method (#6818 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-08-21 16:58:41 +08:00
Robin Kobus	b95cab2a7c	[None][ci] move unittests to sub-directories (#6635 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-20 05:42:22 -04:00
Jinyang Yuan	0e30fe4372	[None][fix] Fix assertion errors of quantization when using online EPLB (#6922 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-08-19 11:28:36 -07:00
zhhuang-nv	7e135d2ea7	[None][feat] Use Separate QKV Input Layout for Context MLA (#6538 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-08-19 22:04:48 +08:00
Yi Zhang	a15af879ec	[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic (#6615 ) Signed-off-by: yizhang-nv <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-08-19 09:58:44 +08:00
Yuening Li	1f8ae2b2db	[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-08-15 17:15:49 -04:00
dongfengy	0ad0b967bb	[None][fix] Make TP working for Triton MOE (in additional to EP we are using) (#6722 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2025-08-15 16:58:42 -04:00
liji-nv	18ccd053d3	[https://nvbugs/5427801 ][fix] Torch compile support for Llama4 and Ea… (#6858 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-15 11:14:20 -04:00
qianbiao	5c2f0fd03d	[None] [feat] Add Tencent HunYuanMoEV1 model support (#5521 ) Signed-off-by: sorenwu <sorenwu@tencent.com> Co-authored-by: sorenwu <sorenwu@tencent.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-08-15 06:56:44 +08:00
Bo Li	26f413ad90	[https://nvbugs/5450262 ][fix] Fix unsupported alltoall use case (#6882 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-08-14 17:46:54 -04:00
Anthony Chang	2198587b35	[https://nvbugs/5378031 ] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-08-13 21:24:40 +08:00
Void	1d80df0955	[None][feat] DeepEP LL combine FP4 (#6822 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-13 04:20:21 -04:00
Fanrong Li	1bbc0e323b	[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc (#6811 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-08-13 10:27:57 +08:00
dongxuy04	bd9a6dd9ab	[TRTLLM-7008][fix] fix wideEP weights loading and args (#6789 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-08-12 19:14:20 -04:00
Robin Kobus	dd11e08d26	[#6187 ][feat] add LayerNorm module (#6625 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-12 21:43:30 +02:00
Jhao-Ting Chen	a060e12041	[https://nvbugs/5438869 ][fix] Set nvfp4 expert w1 w3 weight scale to the same value if they're not (#6656 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-08-12 20:47:10 +08:00
Sergey Klevtsov	27fc35175e	[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294 ) Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-12 15:56:48 +08:00
Jinyang Yuan	ead89a0e40	[None][perf] Improve the performance of online EPLB on Hopper by better overlapping (#6624 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-08-12 09:25:13 +08:00
rakib-hasan	7ab8112450	[None][fix] Refactoring to avoid circular import when importing torch models (#6720 ) Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-08-11 18:00:42 -04:00
shaharmor98	14b36e07d7	[TRTLLM-6174][feat] Enable FP32 mamba ssm cache (#6574 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-08-10 16:27:51 -04:00
dongfengy	d06675071e	[None][fix] WAR GPT OSS on H20 with Triton MOE (#6721 ) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>	2025-08-08 19:47:09 -04:00
JunyiXu-nv	5f45227a93	[https://nvbugs/5437106 ][fix] Fix llama4 scout TRTLLM attn_backend (#6690 ) Signed-off-by: Junyi Xu <junyix@nvidia.com>	2025-08-08 17:48:23 +08:00
Li Min	d913955952	[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell (#6616 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-08-08 15:03:48 +08:00
NVJiangShao	2f2f5cc72c	[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-08-08 11:13:42 +08:00
hlu1	8207d5fd39	[None] [feat] Add model gpt-oss (#6645 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-08-07 03:04:18 -04:00
Zongfei Jing	0ff8df95b7	[https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA (#6588 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-06 16:44:21 +08:00

1 2 3 4 5

222 Commits