TensorRT-LLMs/cpp/tensorrt_llm/kernels
qixiang-99 0d4d50a745
feat: no-cache attention in PyTorch workflow (#3085)
* init trtllm attn no cache

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix the seq_len issue and attn metadata prepare for qwen reward model test

fix: fix minor bugs after rebase
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unnecessary debug logs and clean up commented code

refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment

remove Debug code.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: Resolve comments for Python code

Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata

Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* docs: Add is_dummy_attention field to attention metadata for simulation operations

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: add KVCacheParams to attention backend interface and import relevant metadata classes

Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix rebase format issue

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: extend attention mask type handling in MHARunnerFixedParams

Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: enhance attention mask type handling in TllmGenFmhaRunnerParams

Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

---------

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
2025-04-05 01:54:32 +08:00
..
beamSearchKernels feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133) 2025-04-02 12:31:28 +08:00
contextFusedMultiHeadAttention feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
cutlass_kernels [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00
decoderMaskedMultiheadAttention fix: segfault in cudaDriverWrapper (#3017) 2025-04-02 08:55:19 +02:00
flashMLA Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
fusedLayernormKernels update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
internal_cutlass_kernels update internal_cutlass version.txt to d03df7b27 (#3279) 2025-04-04 15:50:03 +08:00
lora Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
selectiveScan Update TensorRT-LLM (#2820) 2025-02-25 21:21:49 +08:00
speculativeDecoding Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
trtllmGenKernels feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
unfusedAttentionKernels Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
userbuffers update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
weightOnlyBatchedGemv feat: Update cutlass (#2981) 2025-03-26 22:36:27 +08:00
allReduceFusionKernels.cu update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
allReduceFusionKernels.h update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
attentionMask.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
attentionMask.h Update TensorRT-LLM (#2363) 2024-10-22 20:27:35 +08:00
banBadWords.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
banBadWords.h Update TensorRT-LLM (#2008) 2024-07-23 23:05:09 +08:00
banRepeatNgram.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
banRepeatNgram.h Update TensorRT-LLM (#1598) 2024-05-14 16:43:41 +08:00
beamSearchKernels.cu feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133) 2025-04-02 12:31:28 +08:00
beamSearchKernels.h feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133) 2025-04-02 12:31:28 +08:00
buildRelativeAttentionBiasKernel.cu Update TensorRT-LLM (#1763) 2024-06-11 16:59:02 +08:00
buildRelativeAttentionBiasKernel.h Update TensorRT-LLM (#1763) 2024-06-11 16:59:02 +08:00
CMakeLists.txt Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
cumsumLastDim.cu open source 7f370deb0090d885d7518c2b146399ba3933c004 (#2273) 2024-09-30 13:51:19 +02:00
cumsumLastDim.h Update TensorRT-LLM (#1725) 2024-06-04 20:26:32 +08:00
customAllReduceKernels.cu Update (#2978) 2025-03-23 16:39:35 +08:00
customAllReduceKernels.h perf: Add optimizations for deepseek in min latency mode (#3093) 2025-04-02 09:05:24 +08:00
decoderMaskedMultiheadAttention.cu Update TensorRT-LLM (#2502) 2024-11-26 16:51:34 +08:00
decoderMaskedMultiheadAttention.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
decoderMaskedMultiheadAttentionUtils.h Update TensorRT-LLM (#2363) 2024-10-22 20:27:35 +08:00
decodingCommon.cu Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
decodingKernels.cu refactor: Improve decoder finalize function (#3077) 2025-03-28 14:33:59 +08:00
decodingKernels.h refactor: Improve decoder finalize function (#3077) 2025-03-28 14:33:59 +08:00
delayStream.cu Update (#2978) 2025-03-23 16:39:35 +08:00
delayStream.h Update (#2978) 2025-03-23 16:39:35 +08:00
doraScaling.cu Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
doraScaling.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
fmhaDispatcher.cpp feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
fmhaDispatcher.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
gptKernels.cu Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
gptKernels.h Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
groupGemm.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
groupGemm.h Update TensorRT-LLM (#2562) 2024-12-11 00:31:05 -08:00
kvCachePartialCopy.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
kvCacheUtils.h Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
layernormKernels.cu Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
layernormKernels.h Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
logitsBitmask.cu bitmask v3 (#3009) 2025-03-26 15:21:29 +08:00
logitsBitmask.h Update TensorRT-LLM (#2532) 2024-12-04 21:16:56 +08:00
lookupKernels.cu Update TensorRT-LLM (#1639) 2024-05-21 17:51:02 +08:00
lookupKernels.h Update TensorRT-LLM (#1639) 2024-05-21 17:51:02 +08:00
lruKernel.cu Update TensorRT-LLM (#1688) 2024-05-28 20:07:49 +08:00
lruKernel.h Update TensorRT-LLM (#1688) 2024-05-28 20:07:49 +08:00
mambaConv1dKernels.cu Update TensorRT-LLM (#2562) 2024-12-11 00:31:05 -08:00
mambaConv1dKernels.h Update TensorRT-LLM (#1954) 2024-07-16 15:30:25 +08:00
mlaKernels.cu Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
mlaKernels.h Update TensorRT-LLM (#2936) 2025-03-18 21:25:19 +08:00
multiHeadAttentionCommon.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
noAuxTcKernels.cu Update (#2978) 2025-03-23 16:39:35 +08:00
noAuxTcKernels.h Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
penaltyKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
penaltyKernels.h Update TensorRT-LLM (#2502) 2024-11-26 16:51:34 +08:00
penaltyTypes.h Update TensorRT-LLM (#1554) 2024-05-07 23:34:28 +08:00
preQuantScaleKernel.cu open source 3706e7395b9b58994412617992727c8ff2d14c9f (#2010) 2024-07-24 05:48:06 +08:00
preQuantScaleKernel.h Update TensorRT-LLM (#1274) 2024-03-12 18:15:52 +08:00
qserveGemm.h Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
qserveGemmPerChannel.cu Update TensorRT-LLM (#2532) 2024-12-04 21:16:56 +08:00
qserveGemmPerGroup.cu Update TensorRT-LLM (#2502) 2024-11-26 16:51:34 +08:00
quantization.cu update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
quantization.cuh update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
quantization.h update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
rmsnormKernels.cu Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
rmsnormKernels.h Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
sageAttentionKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
sageAttentionKernels.h Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
samplingAirTopPKernels.cu Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
samplingTopKKernels.cu Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
samplingTopKKernels.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
samplingTopPKernels.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
samplingTopPKernels.h Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
splitkGroupGemm.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
splitkGroupGemm.h Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
stopCriteriaKernels.cu Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
stopCriteriaKernels.h open source 4dbf696ae9b74a26829d120b67ab8443d70c8e58 (#2297) 2024-10-08 12:19:19 +02:00
topkLastDim.cu Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
topkLastDim.h Update TensorRT-LLM (#2436) 2024-11-12 15:27:49 +08:00
unfusedAttentionKernels.cu fix: fix for cp > kvHeadNum (#3002) 2025-03-26 12:39:02 +08:00
unfusedAttentionKernels.h fix: fix for cp > kvHeadNum (#3002) 2025-03-26 12:39:02 +08:00
xqaDispatcher.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
xqaDispatcher.h Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00