TensorRT-LLMs/cpp/tensorrt_llm/thop
qixiang-99 0d4d50a745
feat: no-cache attention in PyTorch workflow (#3085)
* init trtllm attn no cache

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix the seq_len issue and attn metadata prepare for qwen reward model test

fix: fix minor bugs after rebase
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unnecessary debug logs and clean up commented code

refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment

remove Debug code.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: Resolve comments for Python code

Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata

Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* docs: Add is_dummy_attention field to attention metadata for simulation operations

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: add KVCacheParams to attention backend interface and import relevant metadata classes

Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix rebase format issue

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: extend attention mask type handling in MHARunnerFixedParams

Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: enhance attention mask type handling in TllmGenFmhaRunnerParams

Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

---------

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
2025-04-05 01:54:32 +08:00
..
allgatherOp.cpp Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
allreduceOp.cpp None - Add one-shot version for UB AR NORM FP16/BF16 (#2995) 2025-03-31 11:16:03 +08:00
attentionOp.cpp feat: no-cache attention in PyTorch workflow (#3085) 2025-04-05 01:54:32 +08:00
CMakeLists.txt Update (#2978) 2025-03-23 16:39:35 +08:00
convertSpecDecodingMaskToPackedMaskOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
cublasScaledMM.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
cutlassScaledMM.cpp Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
deepseekAllreduceFusionOp.cpp update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
dynamicDecodeOp.cpp v1.2 (#3082) 2025-03-26 23:31:29 +08:00
dynamicDecodeOp.h Update TensorRT-LLM (#2783) 2025-02-13 18:40:22 +08:00
fmhaPackMaskOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
fp4BatchedQuantize.cpp update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
fp4Gemm.cpp update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
fp4Op.cpp update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
fp4Quantize.cpp update FP4 quantize layout (#3045) 2025-04-03 13:13:54 -04:00
fp8BlockScaleMoe.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
fp8BlockScalingGemm.cpp [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00
fp8Op.cpp Update TensorRT-LLM (#2755) 2025-02-11 03:01:00 +00:00
fp8Quantize.cpp [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00
gatherTreeOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
logitsBitmaskOp.cpp Update (#2978) 2025-03-23 16:39:35 +08:00
mambaConv1dOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
moeOp.cpp [feat] open source fp8_blockscale_gemm (#3071) 2025-04-02 12:12:52 +08:00
mtpOp.cpp Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
ncclCommunicatorOp.cpp Update TensorRT-LLM (#941) 2024-01-23 23:22:35 +08:00
ncclCommunicatorOp.h Update TensorRT-LLM (#787) 2024-01-02 17:54:32 +08:00
noAuxTcOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
parallelDecodeKVCacheUpdateOp.cpp Update TensorRT-LLM (#2582) 2024-12-16 21:50:47 -08:00
redrafterCurandOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
reducescatterOp.cpp Update TensorRT-LLM (#2792) 2025-02-18 21:27:39 +08:00
relativeAttentionBiasOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
selectiveScanOp.cpp Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00
thUtils.cpp Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
thUtils.h Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00
userbuffersFinalizeOp.cpp Update (#2978) 2025-03-23 16:39:35 +08:00
weightOnlyQuantOp.cpp Update TensorRT-LLM (#2849) 2025-03-04 18:44:00 +08:00