TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Bo Li 515dd0d78f feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 ) * fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compilation error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * pick blackwell support Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add missing license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit `f0c859d459`. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com> Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: Dylan Chen <ziqingc@nvidia.com> Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>		2025-04-07 15:14:13 +08:00
..
beamSearchKernels	feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133 )	2025-04-02 12:31:28 +08:00
contextFusedMultiHeadAttention	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
cutlass_kernels	[feat] open source fp8_blockscale_gemm (#3071 )	2025-04-02 12:12:52 +08:00
decoderMaskedMultiheadAttention	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
flashMLA	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
fusedLayernormKernels	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
internal_cutlass_kernels	chore: update internal cutlass library base #2981 and #3165 . (#3308 )	2025-04-07 13:53:02 +08:00
lora	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
selectiveScan	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
speculativeDecoding	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
trtllmGenKernels	feat: no-cache attention in PyTorch workflow (#3085 )	2025-04-05 01:54:32 +08:00
unfusedAttentionKernels	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
userbuffers	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
weightOnlyBatchedGemv	feat: Update cutlass (#2981 )	2025-03-26 22:36:27 +08:00
allReduceFusionKernels.cu	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
allReduceFusionKernels.h	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
attentionMask.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
attentionMask.h	Update TensorRT-LLM (#2363 )	2024-10-22 20:27:35 +08:00
banBadWords.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
banBadWords.h	Update TensorRT-LLM (#2008 )	2024-07-23 23:05:09 +08:00
banRepeatNgram.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
banRepeatNgram.h	Update TensorRT-LLM (#1598 )	2024-05-14 16:43:41 +08:00
beamSearchKernels.cu	feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133 )	2025-04-02 12:31:28 +08:00
beamSearchKernels.h	feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133 )	2025-04-02 12:31:28 +08:00
buildRelativeAttentionBiasKernel.cu	Update TensorRT-LLM (#1763 )	2024-06-11 16:59:02 +08:00
buildRelativeAttentionBiasKernel.h	Update TensorRT-LLM (#1763 )	2024-06-11 16:59:02 +08:00
CMakeLists.txt	Update TensorRT-LLM (#2936 )	2025-03-18 21:25:19 +08:00
cumsumLastDim.cu	open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273 )	2024-09-30 13:51:19 +02:00
cumsumLastDim.h	Update TensorRT-LLM (#1725 )	2024-06-04 20:26:32 +08:00
customAllReduceKernels.cu	Update (#2978 )	2025-03-23 16:39:35 +08:00
customAllReduceKernels.h	perf: Add optimizations for deepseek in min latency mode (#3093 )	2025-04-02 09:05:24 +08:00
decoderMaskedMultiheadAttention.cu	Update TensorRT-LLM (#2502 )	2024-11-26 16:51:34 +08:00
decoderMaskedMultiheadAttention.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
decoderMaskedMultiheadAttentionUtils.h	Update TensorRT-LLM (#2363 )	2024-10-22 20:27:35 +08:00
decodingCommon.cu	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
decodingKernels.cu	refactor: Improve decoder finalize function (#3077 )	2025-03-28 14:33:59 +08:00
decodingKernels.h	refactor: Improve decoder finalize function (#3077 )	2025-03-28 14:33:59 +08:00
delayStream.cu	Update (#2978 )	2025-03-23 16:39:35 +08:00
delayStream.h	Update (#2978 )	2025-03-23 16:39:35 +08:00
doraScaling.cu	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
doraScaling.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
fmhaDispatcher.cpp	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
fmhaDispatcher.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
gptKernels.cu	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
gptKernels.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
groupGemm.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
groupGemm.h	Update TensorRT-LLM (#2562 )	2024-12-11 00:31:05 -08:00
kvCachePartialCopy.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
kvCacheUtils.h	Update TensorRT-LLM (#2582 )	2024-12-16 21:50:47 -08:00
layernormKernels.cu	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
layernormKernels.h	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
logitsBitmask.cu	bitmask v3 (#3009 )	2025-03-26 15:21:29 +08:00
logitsBitmask.h	Update TensorRT-LLM (#2532 )	2024-12-04 21:16:56 +08:00
lookupKernels.cu	Update TensorRT-LLM (#1639 )	2024-05-21 17:51:02 +08:00
lookupKernels.h	Update TensorRT-LLM (#1639 )	2024-05-21 17:51:02 +08:00
lruKernel.cu	Update TensorRT-LLM (#1688 )	2024-05-28 20:07:49 +08:00
lruKernel.h	Update TensorRT-LLM (#1688 )	2024-05-28 20:07:49 +08:00
mambaConv1dKernels.cu	Update TensorRT-LLM (#2562 )	2024-12-11 00:31:05 -08:00
mambaConv1dKernels.h	Update TensorRT-LLM (#1954 )	2024-07-16 15:30:25 +08:00
mlaKernels.cu	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
mlaKernels.h	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
multiHeadAttentionCommon.h	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 )	2025-04-07 15:14:13 +08:00
noAuxTcKernels.cu	Update (#2978 )	2025-03-23 16:39:35 +08:00
noAuxTcKernels.h	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
penaltyKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
penaltyKernels.h	Update TensorRT-LLM (#2502 )	2024-11-26 16:51:34 +08:00
penaltyTypes.h	Update TensorRT-LLM (#1554 )	2024-05-07 23:34:28 +08:00
preQuantScaleKernel.cu	open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010 )	2024-07-24 05:48:06 +08:00
preQuantScaleKernel.h	Update TensorRT-LLM (#1274 )	2024-03-12 18:15:52 +08:00
qserveGemm.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
qserveGemmPerChannel.cu	Update TensorRT-LLM (#2532 )	2024-12-04 21:16:56 +08:00
qserveGemmPerGroup.cu	Update TensorRT-LLM (#2502 )	2024-11-26 16:51:34 +08:00
quantization.cu	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
quantization.cuh	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
quantization.h	update FP4 quantize layout (#3045 )	2025-04-03 13:13:54 -04:00
rmsnormKernels.cu	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
rmsnormKernels.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
sageAttentionKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
sageAttentionKernels.h	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
samplingAirTopPKernels.cu	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
samplingTopKKernels.cu	Update TensorRT-LLM (#2849 )	2025-03-04 18:44:00 +08:00
samplingTopKKernels.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
samplingTopPKernels.cu	chore: remove usernames from comments (#3291 )	2025-04-05 13:44:28 +08:00
samplingTopPKernels.h	Update TensorRT-LLM (#2755 )	2025-02-11 03:01:00 +00:00
splitkGroupGemm.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
splitkGroupGemm.h	Update TensorRT-LLM (#2792 )	2025-02-18 21:27:39 +08:00
stopCriteriaKernels.cu	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
stopCriteriaKernels.h	open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297 )	2024-10-08 12:19:19 +02:00
topkLastDim.cu	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
topkLastDim.h	Update TensorRT-LLM (#2436 )	2024-11-12 15:27:49 +08:00
unfusedAttentionKernels.cu	fix: fix for cp > kvHeadNum (#3002 )	2025-03-26 12:39:02 +08:00
unfusedAttentionKernels.h	fix: fix for cp > kvHeadNum (#3002 )	2025-03-26 12:39:02 +08:00
xqaDispatcher.cpp	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
xqaDispatcher.h	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00