TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-22 19:52:38 +08:00

Author	SHA1	Message	Date
JadoTu	6bbb43f2b9	[None][feat] Add qwen3-next nvfp4 support (#8526 ) Signed-off-by: jiant <107457950+JadoTu@users.noreply.github.com>	2025-11-06 09:45:44 +08:00
danielafrimi	2b58dba0f6	[https://nvbugs/5524714 ][fix] Fix TP sharding of fused-QKV weight scales in W4A16 AWQ (#8432 ) Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com> Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-11-04 16:42:31 +08:00
Simeng Liu	834a780655	[https://nvbugs/5599086 ][fix] Fix FP8 Linear module for spark (#8707 ) Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-10-29 13:58:19 -07:00
Cheng Hang	15c293a90b	[None][feat] Enable nvfp4 cuda core for sm120 (#8620 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2025-10-29 12:39:03 +08:00
Simeng Liu	2b27810198	[https://nvbugs/5494718 ][fix] Fix Single GPU Multi-node issue and OOM on DGX Spark (#8514 ) Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-10-24 19:09:07 -07:00
Shijie	928247a3f9	[https://nvbugs/5451205 ][feat] Add cuBLASLt NVFP4 GEMM backend support (#7943 ) Signed-off-by: Shijie Wang <jaywan@nvidia.com>	2025-10-23 15:55:10 +08:00
sychen52	6a6124dcb5	[OMNIML-2336][feat] w4a8 nvfp4 fp8 exports scale factor properly (#8180 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com> Co-authored-by: Shiyang Chen <shiychen@omniml-a6.nvidia.com>	2025-10-15 13:41:27 +08:00
Yuxian Qiu	3450fe9944	[None][fix] Fix dummy load format for key models. (#7993 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-10-14 11:18:39 +08:00
Yuxian Qiu	48fda86c56	[None][fix] Fix dummy load format for DeepSeek. (#7874 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-24 23:03:16 +08:00
Stefan Niebler	a55251bf75	[None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod (#7732 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-09-18 10:30:50 +02:00
Li Min	14e455da3e	[None][fix] Fix CI issue for dsl pkg install (#7784 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-09-18 13:58:20 +08:00
Li Min	b278d06481	[TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op (#7632 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-09-16 14:25:26 +08:00
xiweny	c076a02b38	[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices (#7568 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Signed-off-by: Daniel Stokes <dastokes@nvidia.com> Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> Signed-off-by: Xiwen Yu <xiweny@nvidia.com> Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com> Co-authored-by: Daniel Stokes <dastokes@nvidia.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Jiagan Cheng <jiaganc@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-09-16 09:56:18 +08:00
Dom Brown	fc9d426589	[https://nvbugs/5505402 ] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-09-10 18:30:48 +01:00
sychen52	98a1bffb7c	[OMNIML-2336][feat] Add NVFP4 x FP8 (#6809 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-04 09:03:38 -07:00
Aurelien Chartier	93e623b455	[https://nvbugs/5449155 ][fix] Fix DeepSeek R1 weight loading for TP16 (#6913 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-01 11:02:31 +08:00
Tian Zheng	e257cb3533	[None][feat] Support NVFP4 KV Cache (#6244 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-09-01 09:24:52 +08:00
Yukun He	9c5b464fe0	[None][feat] Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time. (#7113 ) Because deep_gemm.gp8_gemm_nt will trigger many JIT processes during the inference phase, we need to sweep these shapes ahead of time. Apply the AutoTuner framework to achieve this and retain the potential capability to tune the swap_ab flag. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-08-25 10:48:31 +08:00
Anthony Chang	2198587b35	[https://nvbugs/5378031 ] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-08-13 21:24:40 +08:00
JunyiXu-nv	5f45227a93	[https://nvbugs/5437106 ][fix] Fix llama4 scout TRTLLM attn_backend (#6690 ) Signed-off-by: Junyi Xu <junyix@nvidia.com>	2025-08-08 17:48:23 +08:00
Li Min	d913955952	[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell (#6616 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-08-08 15:03:48 +08:00
hlu1	8207d5fd39	[None] [feat] Add model gpt-oss (#6645 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-08-07 03:04:18 -04:00
Zongfei Jing	0ff8df95b7	[https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA (#6588 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-06 16:44:21 +08:00
liji-nv	1daa8c3232	[https://nvbugs/5340941 ][https://nvbugs/5375785 ] - fix: Wrap attentio… (#6355 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-01 07:38:06 -04:00
Zongfei Jing	7bb0a78631	Deepseek R1 FP8 Support on Blackwell (#6486 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-08-01 10:26:28 +08:00
Yuening Li	e8c068b4b1	[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com> Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-07-21 15:17:35 +08:00
danielafrimi	5300a99bd8	W4A8 GEMM (#6005 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-20 17:34:57 +03:00
Aurelien Chartier	812243bdd6	feat: add support for Modelopt fp8_pb_wo quantization scheme (#6106 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-07-18 10:35:12 +08:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
DylanChen-NV	5ca2b9bb15	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-07 18:04:57 +08:00
Rashid Kaleem	2b0c87e613	[ModelLoad] Concurrent load model (#5291 ) Signed-off-by: Rashid K <rkaleem@nvidia.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>	2025-07-03 22:18:04 +08:00
Aurelien Chartier	fa95e402a5	feat: add LLmArgs option to force using dynamic quantization (#5346 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-07-01 12:16:09 -07:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
HuiGao-NV	43192379af	Use backend to replace macro to control enablement of MNNVL all reduce (#4635 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 11:22:49 +08:00
hlu1	320195dc0d	[Architecture] Refactor FusedMoE (#4790 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-06-03 14:02:19 +08:00
hlu1	3093c747b7	[Architecture] Redesign Linear module (#4721 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-29 16:05:46 -07:00
shaharmor98	27afcb9928	add changes for fp8, nemotron-nas, API (#4180 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-18 23:27:25 +08:00
Tracin	46c5a56444	Support dynamic per-tensor FP8 (#4250 ) * Support dynamic per-tensor FP8 Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> * Update test cases. Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> --------- Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-05-16 13:33:58 +08:00
yuxianq	4f8afe4cc6	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-16 04:16:53 +08:00
yuxianq	0e87fcc228	refactor: use x is None instead of x == None. (#4244 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-15 20:00:04 +08:00
shaharmor98	7d94c9561f	feat: support multi lora adapters and TP (#3885 ) * support multi lora, tp Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-08 23:45:45 +08:00
bhsueh_NV	129bf19980	model: support Qwen3 (#4010 ) * add qwen3 dense model pytorch backend support, initial commit solve the results error issue add qwen3 moe model pytorch backend support reformat the code * perf - use flash_infer rmsnorm for qwen3 * feat - support qwen3 moe rmsnorm * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications. * fix bugs of running qwen3 public models and fp8 models Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs due to rebase Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs captured by pre-commi Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Co-authored-by: Keddy Jin <jin.gq@aliyun.com> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Co-authored-by: shao <shao@nvidia.com>	2025-05-01 23:12:41 +08:00
hlu1	d2f312b8e4	Fix fp8 kvcache (#3877 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-29 10:31:10 +08:00
hlu1	cd2bcdc1a9	Fix create_weights in attention (#3692 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-24 07:30:00 +08:00
hlu1	31624b079a	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 ) * Add TRT-LLM Gen MOE to Deepseek fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix rebase issues Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix compile issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * CI clean Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Hao Lu <haolu@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-21 10:01:33 +08:00
hlu1	b6bae33453	Clean up linear.py, mlp.py, gated_mlp.py (#3553 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-16 12:21:44 -07:00
QI JUN	d167cbd5bb	refactor: remove ParallelConfig in tensorrt_llm._torch.distributed module (#3370 ) * remove tensorrt_llm._torch.distributed.ParallelConfig Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix embedding test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix comments Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * polish Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * rebase Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-11 15:34:20 -07:00
danielafrimi	47f5cf6c0d	lora_tests (#3201 ) LoRA tests and layers Signed-off-by: Ubuntu <dafrimi@nvidia.com> Co-authored-by: Ubuntu <dafrimi@nvidia.com>	2025-04-09 18:06:52 +03:00

1 2

61 Commits