TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
liji-nv	ff4212377c	[fix] Fix illegal mem access and possible accuracy lose (#4943 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-08 11:19:42 +08:00
Robin Kobus	20425deb3b	[https://nvbugs/5238105 ] fix: ModelRunnerCpp num_return_sequences (#3951 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-06 12:31:11 +02:00
Gabriel Wu	df0aeae0cd	Fix DeepGEMM NVCC Path (#4886 ) Signed-off-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com>	2025-06-05 11:55:37 +08:00
Daniel Cámpora	64d5eba9c7	Fix: max_num_sequences calculation with overlap scheduling into release/0.20 (#4889 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-04 22:33:12 +08:00
Yechan Kim	565abb6887	fix: [nvbugs/5298600] fix illegal memory access on mrope_position_deltas (#4830 ) Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-03 14:56:50 +08:00
Faraz	10d5af06e0	[NVBUG-5291971] JIT path for XQA (#4675 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-06-02 16:24:59 +02:00
Pamela Peng	52465216f4	[https://nvbugs/5295389 ][fix]fix moe fp4 on sm120 (#4624 ) Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>	2025-05-29 09:50:47 -07:00
Robin Kobus	7c1565a2b6	[nvbugs/5274894] fix: Sort requests for functional correctness and performance (#4608 ) * Revert "[nvbugs/5274894] fix: Moving finished context requests to generation (#4576)" This reverts commit `d39bcb6b40`. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Sort requests for functional correctness and performance - Moved sorting related logic to a dedicated function for better clarity and maintainability. - Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID. - Updated function documentation to reflect the sorting behavior and its purpose. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 15:08:54 +02:00
Barry Kang	9e15c035a7	Update internal cutlass kernels commit id (#4619 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-05-23 20:07:41 +08:00
Barry Kang	26793e3569	[https://nvbugs/5289907 ][fix] Restore per-channel pre-quant (#4545 ) * Restore per-channel pre-quant Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> * Update TRT test script Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> * Fix pre-commit Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> --------- Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-05-23 19:46:53 +08:00
Robin Kobus	d39bcb6b40	[nvbugs/5274894] fix: Moving finished context requests to generation (#4576 ) fix: Moving finished context requests to generation - Unfinished chunked context requests appear at end of context requests vector. - Replaced std::find_if with std::partition to find the correct position to move finished context requests to generation. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-22 17:49:40 +02:00
kanghui0204	6f3922f318	feat: Low Precision Allreduce for PCIe based GPU (#4344 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-20 06:53:46 +08:00
Yuxian Qiu	c8e062bfd3	fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-19 14:25:36 -07:00
Perkz Zheng	1c5b0d6a13	[Feat] add chunked-attention kernels on Hopper (for llama4) (#4291 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add mtp for fmha_v2 MLA kernels and add chunked-attention support for hopper fmha kernels Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-05-19 09:57:10 -07:00
Faraz	7656af1b57	[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) (#4335 ) * add mixtral7x8b fp8 test with fixed cutlass fp8 moe gemm Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * update cutlass versions Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added internal cutlass with fix and docker update Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added mixtral to pro 6000 Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> --------- Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-05-19 08:56:21 -07:00
liji-nv	58e405624a	[https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 (#3952 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-05-19 22:12:25 +08:00
Dom Brown	c45f414bbf	Test: Improve model re-use in C++ DGX tests for CI stability (#4263 ) * Fix padded vocab size for Llama Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Refactor multi GPU llama executor tests, and reuse the built model engines Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Fix test list typo Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Further WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Update test lists and readme Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Try parametrize for asymmetric Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Parametrize + skip unsupported combinations Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> * Update test list Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> * Reduce environment duplicated code Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> --------- Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>	2025-05-19 14:20:21 +01:00
Shi Xiaowei	df2798e0c3	feat: NIXL interface integration (#3934 ) NIXL interfaces Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-05-19 18:18:22 +08:00
Void	62bb7f9286	fix potential issues in allreduce fusion kernel and ut (#4226 ) fix allreduce fuison kernels and ut Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> --------- Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-19 17:38:29 +08:00
Jinyang Yuan	b618e1f55b	perf: Eliminate the need for attention DP padding when possible (#3439 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: raccoonliukai <raccoonliu@tencent.com>	2025-05-17 13:30:55 +08:00
Robin Kobus	4e370a509a	refactor: Copy sequence lengths once in decoder setup (#4102 ) * refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update DecoderInputBuffers to remove duplicated buffers - Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability. - Adjusted references in generateRequestOptions.cpp to align with the new buffer structure. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move getEmbeddingBias to anonymous namespace Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Filter context requests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: GenerateRequestOptions using more fine-grained functions - Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests. - Updated the `operator()` method to utilize the new method, improving code clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMDecoder - Updated the `generate_request_options` call. - Updated the `make_decoding_batch_input_output` call. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove const where we modify input buffers - Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications. - Updated related function calls to ensure compatibility with the new parameter types. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-16 22:03:55 +08:00
Nikita Korobov	fa3879629e	feat: TRT-LLM Gen integration for BMM and MoE refactoring (#4280 ) - Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator. - Refactors TRT-LLM Gen MoE runner to call to BMM interface - The accuracy is verified for DeepSeek R1 FP4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-16 13:31:53 +02:00
ixlmar	f7ad49bb9b	chore: improve log-level setting UX (#4352 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-16 09:47:44 +01:00
Yuan Tong	f5ddb7ab4a	fix: support TensorRT 10.11+ in FindTensorRT.cmake (#4353 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-16 14:04:56 +08:00
NVJiangShao	6cc3f2093a	Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow (#4348 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-16 10:02:30 +08:00
Erin	c44cf34373	fix: update checks that broke medusa tests when use_py_session=True (#4339 ) fix check Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-15 15:47:28 -07:00
yuxianq	4f8afe4cc6	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-16 04:16:53 +08:00
yuxianq	0e87fcc228	refactor: use x is None instead of x == None. (#4244 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-15 20:00:04 +08:00
Yuan Tong	593f65ff6a	fix: better method to help torch find nvtx3 (#4110 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-15 16:42:30 +08:00
zhhuang-nv	97bc680cd8	feat: support kv cache reuse for MLA (#3571 ) * support kv cache reuse for MLA load compressed_kv and k_pe and do up-projection use 192/128 head size MLA context kernel support Blackwell and Hopper now Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * add CI test Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2 Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * use GPTJ style RoPE for MLA Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix rebase error and some docs Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix kv_lens Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * tiny fix Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: use normal device memory instead of pinned memory for unit test Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * fix L0 tests Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile after rebase Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments again Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com> Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-05-15 15:22:21 +08:00
Zhanrui Sun	5dc3b539ba	infra: Down the gcc toolset version from 13 to 11 (#4114 ) * Down the gcc toolset version from 13 to 11 Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update rocky8 images Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-05-15 11:08:51 +08:00
qsang-nv	0fd59d64ab	infra: open source fmha v2 kernels (#4185 ) * add fmha repo Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix code style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header kernel_traits.h Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add .gitignore file Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add SLIDING_WINDOW_ATTENTION Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update setup.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update build_wheel.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> --------- Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-05-15 10:56:34 +08:00
QI JUN	498ce8a056	Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340 ) Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)" This reverts commit `5e634dd1bd`.	2025-05-15 09:52:39 +08:00
hlu1	7fb0af9320	[fix] Remove stale cublas heuristics (#4326 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-14 17:35:51 -07:00
Robin Kobus	d31fefde2c	[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092 ) * chore: Remove GptSession/V1 from TRT workflow Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove stateful decoders Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession buffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession utils Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession kernels Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove V1 GPT models from tests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove gptSessionBenchmark from scripts and docs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove gptSession IO classes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession from test lists Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession from docs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove useless encoder test Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove mActualBatchSize from DecoderState Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove static batching from ExecutorTest - Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter. - Adjusted related test functions to reflect the changes in parameter lists. - Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-14 23:10:04 +02:00
Robin Kobus	c67da1fbaa	fix: Eagle decoding in TRT flow (#4229 ) * fix: EagleBuffers lifetime issue Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Clean up Eagle kernel parameters Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Eagle draft tokens init Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Add check for updated sequence length in TrtGptModelInflightBatching Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Skip check for beam search Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-14 16:10:49 +02:00
DylanChen-NV	206f82115d	[bug/5247505] fix: CP accuracy on Blackwell (#4188 ) * fix xqa params for cp Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * try adding B200 multi gpu test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add accuracy tests for cp Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> --------- Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-05-14 17:40:50 +08:00
kanghui0204	5e634dd1bd	feat: Low Precision Allreduce for PCIe based GPU (#3851 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-14 16:45:43 +08:00
Barry Kang	20b42912ce	[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123 ) Support DeepSeek-R1 W4A8 on Hopper Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-05-14 15:48:07 +08:00
Perkz Zheng	e8d7834c50	fix: [https://nvbugspro.nvidia.com/bug/5238626 ] illegal memory address when running llama 4 with cuda graph enabled (#4101 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-13 14:58:54 +08:00
pcastonguay	9643be5f20	[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156 ) * feat: Add per-request stats support with PyT backend Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Adding unit test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing stats unit test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing test with overlap Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-12 21:35:15 -04:00
Simeng Liu	286a789549	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 ) * feat: Add heuristic for GroupRMSNorm kernel selection. Implements a logistic regression model to dynamically select between: - GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions (better SM occupancy in most cases) - GroupRMSNormLargeBatch: Allocates warps proportional to max dimension (better block scheduling in large batch scenarios) Selection heuristic considers batch size, allocated warps, and scheduling efficiency on the current GPU architecture. Models for Compute Capability 9.x and 10.x are trained base on nsys kernel runtime data. The default kernel selection is the base kernel. The python operator group_rms_norm will use the heuristic by default. User can pick to use the base or large batch kernels as well. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Address the comments. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-13 08:52:53 +08:00
wili	eba3623a54	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 ) * feat/vbws-part4-v1.8: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * feat/vbws-part4-v1.9: fix incorrect output when using short output length Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.1: remove useless variables Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.2:fix incorrect output when using short output length Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.3: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.4: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.5: remove API change Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-12 22:32:29 +02:00
Yixin Dong	c90ebadd84	feat: Support the Structural Tag in guided decoding (#4066 ) * finish Signed-off-by: Ubospica <ubospica@gmail.com> * update Signed-off-by: Ubospica <ubospica@gmail.com> * update Signed-off-by: Ubospica <ubospica@gmail.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * exc overlap scheduler Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * add test Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix api ref Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Ubospica <ubospica@gmail.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-12 17:24:50 +08:00
Perkz Zheng	3f29d2f006	Feat: support exporting softmax statistics and update the kernel-selection heuristic (#4155 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * support exporting softmax statistics and update the kernel-selection heuristic Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-12 15:31:46 +08:00
Dom Brown	2d0f93a054	Refactor: Restructure C++ tests for better modularisation of non-shared code (#4027 ) * Refactor: Restructure C++ tests for better modularisation of non-shared code Start cleanup of pytest code for C++ tests Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Clean up names and remove references to test_cpp.py Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Move multi-GPU code Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Update doc and try un-waiving Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Update multi GPU file check Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Address minor multi-GPU setup bug Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> --------- Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-05-09 19:16:51 +01:00
zhhuang-nv	0a36db0aa4	[fix] trtllm-gen mla kernel warnings (#4119 ) fix trtllm-gen mla kernel warnings Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-09 20:21:28 +08:00
NVJiangShao	57b2fe2019	[#4085 ][fix] Fix `apply_per_channel_scale` for extremely large input sequence length. (#4089 ) Fix apply_per_channel_scale for extremely large input seq length. Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: crazy-JiangDongHua <759421566@qq.com>	2025-05-09 11:57:01 +08:00
Yi Zhang	91bf5e6a8e	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 ) Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-05-09 11:04:01 +08:00
Yukun He	5b61486d87	chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 10:20:41 +08:00

1 2 3 4 5 ...

285 Commits