TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

History

Perkz Zheng 4d711be8f4 Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 ) * move cubins to LFS Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add sliding-window-attention generation-phase kernels on Blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>		2025-05-26 09:06:33 +08:00
..
beamSearchKernels	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
communicationKernels	Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216 )	2025-05-22 03:42:36 +08:00
contextFusedMultiHeadAttention	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 )	2025-05-26 09:06:33 +08:00
cutlass_kernels	[feat] support fp8 blockscale gemm on sm89 (#4481 )	2025-05-23 10:39:10 +08:00
decoderMaskedMultiheadAttention	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 )	2025-05-26 09:06:33 +08:00
flashMLA
fusedLayernormKernels
groupRmsNormKernels	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 )	2025-05-13 08:52:53 +08:00
internal_cutlass_kernels	perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146 )	2025-05-21 10:00:36 +08:00
lora
moeLoadBalance	feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384 )	2025-05-20 17:53:48 +08:00
selectiveScan
speculativeDecoding	fix: Eagle decoding in TRT flow (#4229 )	2025-05-14 16:10:49 +02:00
trtllmGenKernels	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 )	2025-05-26 09:06:33 +08:00
unfusedAttentionKernels	feat: add CGA reduction fmha kernels on Blackwell. (#3763 )	2025-04-29 10:43:54 +08:00
userbuffers
weightOnlyBatchedGemv
attentionMask.cu
attentionMask.h
banBadWords.cu
banBadWords.h
banRepeatNgram.cu
banRepeatNgram.h
beamSearchKernels.cu	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
beamSearchKernels.h	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
buildRelativeAttentionBiasKernel.cu
buildRelativeAttentionBiasKernel.h
CMakeLists.txt	infra: open source fmha v2 kernels (#4185 )	2025-05-15 10:56:34 +08:00
cumsumLastDim.cu
cumsumLastDim.h
customAllReduceKernels.cu	chore: bump version to 0.19.0 (#3598 ) (#3841 )	2025-04-29 16:57:22 +08:00
customAllReduceKernels.h	feat: Low Precision Allreduce for PCIe based GPU (#4344 )	2025-05-20 06:53:46 +08:00
decoderMaskedMultiheadAttention.cu
decoderMaskedMultiheadAttention.h	fix: [https://nvbugspro.nvidia.com/bug/5238626 ] illegal memory address when running llama 4 with cuda graph enabled (#4101 )	2025-05-13 14:58:54 +08:00
decoderMaskedMultiheadAttentionUtils.h
decodingCommon.cu
decodingKernels.cu	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 )	2025-05-12 22:32:29 +02:00
decodingKernels.h
delayStream.cu
delayStream.h
doraScaling.cu
doraScaling.h
fmhaDispatcher.cpp	Feat: add chunked-attention kernels on Blackwell (#4394 )	2025-05-21 10:16:46 +08:00
fmhaDispatcher.h
fusedQKNormRopeKernel.cu	perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482 )	2025-05-23 15:31:04 +08:00
fusedQKNormRopeKernel.h	perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482 )	2025-05-23 15:31:04 +08:00
gptKernels.cu
gptKernels.h	feat: add CGA reduction fmha kernels on Blackwell. (#3763 )	2025-04-29 10:43:54 +08:00
groupGemm.cu
groupGemm.h
kvCachePartialCopy.cu
kvCacheUtils.h
layernormKernels.cu
layernormKernels.h
logitsBitmask.cu
logitsBitmask.h
lookupKernels.cu
lookupKernels.h
lruKernel.cu
lruKernel.h
mambaConv1dKernels.cu
mambaConv1dKernels.h
mlaKernels.cu	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 )	2025-05-23 19:47:50 +08:00
mlaKernels.h	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 )	2025-05-23 19:47:50 +08:00
moeCommKernels.cu	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
moeCommKernels.h	feat: Add MNNVL MoE A2A support (#3504 )	2025-04-25 17:29:08 +08:00
multiHeadAttentionCommon.h
noAuxTcKernels.cu
noAuxTcKernels.h
penaltyKernels.cu
penaltyKernels.h
penaltyTypes.h
preQuantScaleKernel.cu	[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123 )	2025-05-14 15:48:07 +08:00
preQuantScaleKernel.h	[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123 )	2025-05-14 15:48:07 +08:00
qserveGemm.h
qserveGemmPerChannel.cu
qserveGemmPerGroup.cu
quantization.cu
quantization.cuh
quantization.h
recoverFromRingAtten.cu	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
recoverFromRingAtten.h	Support RingAttention in the BertAttention plugin and the DiT model (#3661 )	2025-05-09 08:06:54 +08:00
rmsnormKernels.cu
rmsnormKernels.h
sageAttentionKernels.cu
sageAttentionKernels.h
samplingAirTopPKernels.cu
samplingTopKKernels.cu
samplingTopKKernels.h
samplingTopPKernels.cu
samplingTopPKernels.h
splitkGroupGemm.cu
splitkGroupGemm.h
stopCriteriaKernels.cu
stopCriteriaKernels.h
topkLastDim.cu
topkLastDim.h
unfusedAttentionKernels.cu
unfusedAttentionKernels.h
xqaDispatcher.cpp	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 )	2025-05-26 09:06:33 +08:00
xqaDispatcher.h