TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Jinyang Yuan	dafc28fb85	fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863 )	2025-04-29 09:09:43 +08:00
qixiang-99	ecd621fb0a	feat: Add head size 72 support for QKV Preprocessing kernel (#3743 ) * refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow - Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels. - Added support for head size 72 in unfused attention kernels(QKVPreprocessing). - Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72). Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * update: Waive head_dim=72 test cases and enhance test representation - Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues. - Introduced a custom __repr__ method in the Scenario class for pytest substring match. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-04-25 11:07:40 -07:00
Pamela Peng	6cdfc54883	feat: Add FP8 support for SM 120 (#3248 ) * Allow FP8 on SM120 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix sm121 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix pre-commit Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * review update Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-04-14 16:05:41 -07:00
Bo Li	515dd0d78f	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 ) * fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compilation error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * pick blackwell support Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add missing license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit `f0c859d459`. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com> Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: Dylan Chen <ziqingc@nvidia.com> Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-07 15:14:13 +08:00
tburt-nv	7a659885e3	chore: remove usernames from comments (#3291 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-05 13:44:28 +08:00
qixiang-99	0d4d50a745	feat: no-cache attention in PyTorch workflow (#3085 ) * init trtllm attn no cache Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: fix the seq_len issue and attn metadata prepare for qwen reward model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove unnecessary debug logs and clean up commented code refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: Resolve comments for Python code Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * docs: Add is_dummy_attention field to attention metadata for simulation operations Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: add KVCacheParams to attention backend interface and import relevant metadata classes Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: fix rebase format issue Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: extend attention mask type handling in MHARunnerFixedParams Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: enhance attention mask type handling in TllmGenFmhaRunnerParams Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> --------- Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>	2025-04-05 01:54:32 +08:00
DylanChen-NV	1ac0566a93	fix: fix for cp > kvHeadNum (#3002 ) * fix for cp > kvHeadNum Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * fix for None kv_head_num Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> --------- Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-03-26 12:39:02 +08:00
Perkz Zheng	e9df23f815	fix: [MLA] fix the bug with fp8 MLA kernels on Blackwell. (#3008 ) * update cubins * update error message --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-03-25 18:03:29 +08:00
Kaiyu Xie	3aa6b11d13	Update TensorRT-LLM (#2936 ) * Update TensorRT-LLM --------- Co-authored-by: changcui <cuichang147@gmail.com>	2025-03-18 21:25:19 +08:00
Kaiyu Xie	9b931c0f63	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
Kaiyu Xie	77d7fe1eb2	Update TensorRT-LLM (#2849 ) * Update TensorRT-LLM --------- Co-authored-by: aotman <chenhangatm@gmail.com>	2025-03-04 18:44:00 +08:00
Kaiyu Xie	ab5b19e027	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
Kaiyu Xie	2ea17cdad2	Update TensorRT-LLM (#2792 ) * Update TensorRT-LLM --------- Co-authored-by: jlee <jungmoolee@clika.io>	2025-02-18 21:27:39 +08:00
Kaiyu Xie	e88da961c5	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
Dan Blanaru	16d2467ea8	Update TensorRT-LLM (#2755 ) * Update TensorRT-LLM --------- Co-authored-by: Denis Kayshev <topenkoff@gmail.com> Co-authored-by: akhoroshev <arthoroshev@gmail.com> Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com> Update	2025-02-11 03:01:00 +00:00

15 Commits