TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 20:23:08 +08:00

Author	SHA1	Message	Date
liji-nv	58e405624a	[https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 (#3952 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-05-19 22:12:25 +08:00
yuxianq	a1daa22970	doc: Add docstring for Attention and MLA module. (#4354 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-05-16 09:37:04 +08:00
zhhuang-nv	97bc680cd8	feat: support kv cache reuse for MLA (#3571 ) * support kv cache reuse for MLA load compressed_kv and k_pe and do up-projection use 192/128 head size MLA context kernel support Blackwell and Hopper now Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * add CI test Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2 Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * use GPTJ style RoPE for MLA Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix rebase error and some docs Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix kv_lens Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * tiny fix Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: use normal device memory instead of pinned memory for unit test Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * fix L0 tests Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile after rebase Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments again Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com> Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-05-15 15:22:21 +08:00
brb-nv	8280c3d4f2	feat: Support Gemma3-1b-it in Pytorch workflow (#3999 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-05-14 14:02:44 +08:00
yuxianq	b35f9a67f9	refactor: Allow models to override apply_qk_norm. (#4078 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-12 19:38:24 +08:00
Mike Iovine	4b8ba7ad61	[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue (#4069 ) [fix] Fix llama 4 test lists Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-09 22:45:14 +08:00
shaharmor98	7d94c9561f	feat: support multi lora adapters and TP (#3885 ) * support multi lora, tp Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-08 23:45:45 +08:00
hlu1	26a2679217	[Deepseek] Refactor Deepseek Decoder layer (#4016 ) Refactor Deepseek Decoder layer Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-08 01:43:10 +08:00
yuxianq	017701343e	fix: apply rope twice in Qwen3. (#4040 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-05 15:12:45 +08:00
bhsueh_NV	129bf19980	model: support Qwen3 (#4010 ) * add qwen3 dense model pytorch backend support, initial commit solve the results error issue add qwen3 moe model pytorch backend support reformat the code * perf - use flash_infer rmsnorm for qwen3 * feat - support qwen3 moe rmsnorm * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications. * fix bugs of running qwen3 public models and fp8 models Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs due to rebase Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs captured by pre-commi Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Co-authored-by: Keddy Jin <jin.gq@aliyun.com> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Co-authored-by: shao <shao@nvidia.com>	2025-05-01 23:12:41 +08:00
yuxianq	0f8ec693b2	fix: get head_dim from model’s config. (#3916 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 23:04:29 +08:00
Junhong Liu	06e76020d7	feat: parallel q_b_proj and concat (#3917 ) * add parallel_q_b_proj_and_concat Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * code cleanup Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * one gemm/concat and then split the latent_cache and pass them separately to context/gen Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> --------- Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>	2025-04-29 22:07:05 +08:00
hlu1	cd2bcdc1a9	Fix create_weights in attention (#3692 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-24 07:30:00 +08:00
shaharmor98	49262a62a5	add passing E2E LoRA flow (#3788 ) add passing E2E LoRA flow (#3788) Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-23 18:38:06 +03:00
shaharmor98	5fff8f0935	Add running E2E LoRA flow (#3648 ) * add passing E2E LoRA flow Signed-off-by: Shahar Mor <smor@nvidia.com> * add experimental feature Signed-off-by: Shahar Mor <smor@nvidia.com> * fix llma_args definition Signed-off-by: Shahar Mor <smor@nvidia.com> * decreased manually size of max loras to address OOM Signed-off-by: Shahar Mor <smor@nvidia.com> --------- Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-23 11:19:41 +08:00
hlu1	31624b079a	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 ) * Add TRT-LLM Gen MOE to Deepseek fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix rebase issues Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix compile issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * CI clean Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Hao Lu <haolu@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-21 10:01:33 +08:00
yuxianq	5346f53250	feat: Introduce feature properties for attention backend. (#3659 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-19 12:37:27 +08:00
Chang Liu	b8818b45be	fix: llama4: address couple of issues in llama4 attention module (#3491 ) * fix attn module for llama4 * Address comments * Rebase to accommodate latest attn refactor and refactor l4attn * Remove aux_stream from classic attn * Use RMSNorm for L2Norm * Update tensorrt_llm/_torch/models/modeling_llama.py Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: Chang Liu <lc9114@gmail.com> * Add typing informations for _attn_qkv * Remove redundant comment * Simplify llama4 DecoderLayer logic --------- Signed-off-by: Chang Liu <lc9114@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-18 01:54:59 +00:00
yuxianq	b9b1c1368c	feat: Support unfused rope in MLA. (#3610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 16:50:49 +08:00
hlu1	b6bae33453	Clean up linear.py, mlp.py, gated_mlp.py (#3553 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-16 12:21:44 -07:00
yuxianq	fd8ded2b2b	feat: Support cos_sin_cache in all cases. (#3517 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-16 13:48:44 +08:00
hlu1	4855431d3d	[Deepseek] Redesign multi-stream API (#3459 ) Signed-off-by: Hao Lu <haolu@nvidia.com>	2025-04-11 23:00:25 -07:00
QI JUN	d167cbd5bb	refactor: remove ParallelConfig in tensorrt_llm._torch.distributed module (#3370 ) * remove tensorrt_llm._torch.distributed.ParallelConfig Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix embedding test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix comments Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * polish Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * rebase Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-11 15:34:20 -07:00
hlu1	fbcf954d9c	[MLA] Deallocate tensors after use (#3286 ) Signed-off-by: Hao Lu <haolu@nvidia.com>	2025-04-09 21:36:07 -07:00
danielafrimi	47f5cf6c0d	lora_tests (#3201 ) LoRA tests and layers Signed-off-by: Ubuntu <dafrimi@nvidia.com> Co-authored-by: Ubuntu <dafrimi@nvidia.com>	2025-04-09 18:06:52 +03:00
Mike Iovine	5bdf997963	Add Llama 4 (#3302 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-04-09 03:35:21 +08:00
yuxianq	7225bd8b91	chore: Refine attention backend interface. (#3271 ) Refine attention backend interface. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-09 02:34:53 +08:00
Bo Li	515dd0d78f	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 ) * fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compilation error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * pick blackwell support Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add missing license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit `f0c859d459`. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com> Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: Dylan Chen <ziqingc@nvidia.com> Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-07 15:14:13 +08:00
Kaiyu Xie	2631f21089	Update (#2978 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-03-23 16:39:35 +08:00
Kaiyu Xie	3aa6b11d13	Update TensorRT-LLM (#2936 ) * Update TensorRT-LLM --------- Co-authored-by: changcui <cuichang147@gmail.com>	2025-03-18 21:25:19 +08:00
Kaiyu Xie	77d7fe1eb2	Update TensorRT-LLM (#2849 ) * Update TensorRT-LLM --------- Co-authored-by: aotman <chenhangatm@gmail.com>	2025-03-04 18:44:00 +08:00
Kaiyu Xie	2ea17cdad2	Update TensorRT-LLM (#2792 ) * Update TensorRT-LLM --------- Co-authored-by: jlee <jungmoolee@clika.io>	2025-02-18 21:27:39 +08:00
Kaiyu Xie	e88da961c5	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
Dan Blanaru	16d2467ea8	Update TensorRT-LLM (#2755 ) * Update TensorRT-LLM --------- Co-authored-by: Denis Kayshev <topenkoff@gmail.com> Co-authored-by: akhoroshev <arthoroshev@gmail.com> Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com> Update	2025-02-11 03:01:00 +00:00

34 Commits