TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Matthias Jouanneaux	a0b6c635b1	[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-06-13 06:00:02 +08:00
Xiaodong (Vincent) Huang	cc2a1344be	None: fix OOM because of unnecessary mha workspace (#5056 ) Signed-off-by: Vincent Huang <vincenth@nvidia.com>	2025-06-12 21:56:05 +02:00
liji-nv	10ab9791ec	[fix] Do not reuse dummy request KVCache (#4804 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-12 15:24:50 +08:00
Netanel Haber	e692779ead	Solve underallocation in VSWA+/VGQA (#4667 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-12 12:12:46 +08:00
HuiGao-NV	43192379af	Use backend to replace macro to control enablement of MNNVL all reduce (#4635 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 11:22:49 +08:00
Zheng Duan	ee44fa00f8	chore: rename IOFormatter to BaseCacheFormatter (#5068 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-12 10:50:14 +08:00
Bo Li	1b79041f5d	fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-11 09:38:10 +08:00
Tracin	6c91f1c7ac	Mxfp8xmxfp4 quant mode(#4978 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-10 22:01:37 +08:00
Zongfei Jing	6d1f2d0fd7	[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-06-10 19:55:16 +08:00
Aurelien Chartier	dcf72c6ad3	chore: cleanup GDS Cmake interface (#4928 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-10 17:25:43 +08:00
dongxuy04	7137cc8f67	fix cuda driver link issue with driver version less than 12.3 (#5025 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-10 15:27:39 +08:00
pcastonguay	87c56ab024	perf: Removing initializing ptuning buffers to zero (#4915 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-06-09 21:57:21 -04:00
Daniel Cámpora	d68b8180d3	feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-10 07:28:34 +08:00
Chang Liu	f70815c945	[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-06-10 01:59:56 +08:00
Dom Brown	9c012d5bf8	[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-09 11:02:48 +01:00
liji-nv	1d4f748773	[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … (#5017 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-09 17:50:57 +08:00
ChristinaZ	f45aff2b7d	Add customized renormalized moe routing kernel for moe cutlass backend (#4955 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-09 17:38:50 +08:00
Chuang Zhu	9a874760c1	Kv cache transfer support duplicate heads (#4929 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-09 14:11:19 +08:00
Chuang Zhu	947571c311	Fix buffer count (#5007 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-09 14:01:13 +08:00
Daniel Stokes	3a4851b7c3	feat: Add Mixture of Experts FP8xMXFP4 support (#4750 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-09 13:25:04 +08:00
Omer Ullman Argov	8731f5f14f	chore: Mass integration of release/0.20 (#4898 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: HuiGao-NV <huig@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-08 23:26:26 +08:00
dongxuy04	1e369658f1	feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-06-08 10:25:18 +08:00
Jinyang Yuan	20d0649f19	[feat] Support XQA-based MLA on SM120 (#4858 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-06-06 22:32:49 +08:00
Anthony Chang	eeb555e37b	chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-06 16:13:54 +08:00
qsang-nv	180b91f957	update fmha_v2 (#4895 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-05 22:14:28 +08:00
dongjiyingdjy	51652b9b2b	feat : add PositionEmbeddingType=0 to xqa support (#4934 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-05 21:50:42 +08:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Zheng Duan	dd2191c5b3	fix: correct the order of llm request state (#4781 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 14:45:13 +08:00
Omer Ullman Argov	e71de2a13e	chore: Mass integration of release/0.20. (#4871 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-06-04 14:12:27 +08:00
Zheng Duan	ded694b1aa	feat: cache reuse support (selective cache transfer) in mla cache formatter (#4749 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 09:56:31 +08:00
ChristinaZ	d64af85e8c	Replace memset with data initialization within kernels (#4851 ) Signed-off-by: Christina Zhang <christinaz@nvidia.com>	2025-06-04 08:56:46 +08:00
Perkz Zheng	a089aa3225	[https://nvbugspro.nvidia.com/bug/5300080 ] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-03 19:02:57 -04:00
Nikita Korobov	8043d7a03c	feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-06-03 14:07:54 -07:00
Robin Kobus	3de02582dd	refactor: Separate DecoderState from GptDecoderBatched (#4700 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-03 09:42:01 +02:00
Robin Kobus	b9263a8e10	fix: max_num_sequences calculation with overlap scheduling (#4532 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-03 09:31:22 +02:00
Tian Zheng	9832787050	[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-06-03 10:00:17 +08:00
Yilin Fan	90aab0596e	[fix] Fix Llama4 guradwords failures (#4844 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-02 13:43:42 -07:00
Enwei Zhu	5b4852b7b5	feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-02 01:25:02 +08:00
Netanel Haber	2ce05c3ab4	'entered copyBlock' format string expects %s, pass string rather than int (#4820 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-01 08:54:33 -07:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Daniel Cámpora	69c7fe8905	[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-01 03:32:43 +08:00
Enwei Zhu	25dde49c28	fix: EP load balancer with MTP layer and route offset by EP rank (#4767 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 00:07:44 +08:00
Dom Brown	338d6e9f95	[nvbug 5305210] fix: Resolve nvbug 5305210 (#4759 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-05-31 19:21:06 +08:00
Chuang Zhu	f117d6abe9	Fabric Memory for KV Cache Transfer (#4717 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-30 15:50:21 +08:00
Thor Johnsen	55d56f8155	[JIRA-5226219][fix] Fix Bug in KV cache manager (#4596 ) Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>	2025-05-29 22:03:20 -07:00
Jinyang Yuan	5339d367ce	[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-05-30 09:03:52 +08:00
Yilin Fan	31bb650298	Cherry pick feat/llama4 to main (#4739 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-05-30 05:28:40 +08:00
Robin Kobus	79a94a28f9	refactor: unique_ptr instead of shared_ptr (#4697 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-29 22:49:35 +02:00
Jhao-Ting Chen	fcadce9f8d	[fix] Eagle-2 LLMAPI pybind argument fix. (#3967 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-29 12:23:25 -07:00
Arthur Rasmusson	812b1abf86	feature: KV Cache GPUDirect Storage (#3209 ) Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-28 23:27:43 +00:00
Robin Kobus	12763779c4	chore: Clean up cpp runtime (#4449 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-28 16:32:59 +02:00
ixlmar	fbe4db207d	feat: forward exceptions to Python and catch OOMs (#4497 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-28 11:58:10 +02:00
Kaiyu Xie	b800adc65c	Fix: hang on disagg when MNNVL two-shot AllReduce is enabled (#4678 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-05-28 13:03:53 +08:00
yunruis	29ac4c20e0	fix: fix dsr1 min lat cga ar rate drop(0.2) (#4561 ) Signed-off-by: yunruis <yunruis@nvidia.com>	2025-05-27 21:59:57 +08:00
Perkz Zheng	40a7161f4f	fix: fmha_v2 compilation (#4659 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-27 17:39:39 +08:00
qsang-nv	157fe62965	fix fmha v2 tests (#4661 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-05-27 09:47:01 +08:00
Robin Kobus	93a54457ac	[nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608 ) (#4621 ) - Moved sorting related logic to a dedicated function for better clarity and maintainability. - Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID. - Updated function documentation to reflect the sorting behavior and its purpose. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-26 17:10:55 +08:00
Robin Kobus	502758aaa9	fix: Handle additional model outputs based on pipeline parallel rank (#4498 ) - Only allocate additional outputs on last pipeline parallel rank in trtGptModelInflightBatching and executorImpl. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-26 09:04:40 +02:00
Zheng Duan	ce7f5fae5a	sort llm request state (#4607 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-05-26 13:47:01 +08:00
Perkz Zheng	4d711be8f4	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 ) * move cubins to LFS Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add sliding-window-attention generation-phase kernels on Blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-26 09:06:33 +08:00
shaharmor98	2b8f6d2871	Fix snake case format (#4559 ) fix snake case format Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-25 17:57:17 +08:00
Chuang Zhu	b60846b47d	fix datatype check (#4606 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-24 08:36:17 +08:00
Yao Yao	ef763b0ddc	fix: rename some terms (#4534 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-05-23 23:23:49 +08:00
Robin Kobus	7b2818a47b	refactor: CreateNewDecoderRequests (#4452 ) * refactor: CreateNewDecoderRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Consolidate request generation in CreateNewDecoderRequests - Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests. - Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options. - Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy. - Cleaned up associated includes and references throughout the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Simplify request handling in CreateNewDecoderRequests - Removed the generateRequestOptions method and integrated its logic directly into the operator() method. - Updated the request generation process to improve clarity and reduce redundancy. - Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests - Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling. - Removed redundant request generation logic from the operator() method, streamlining the process. - Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests - Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management. - Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality. - Cleaned up unnecessary includes and improved code organization for better maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters - Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder. - Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling. - Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests - Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency. - Updated bindings and method signatures for decoder stream handling. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 22:54:37 +08:00
zhhuang-nv	8452775db8	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 ) * optimize kv cache reuse workflow for MLA write kv cache first and only call up-projection GEMM once relax contiguous requirements of k/v for setting paged kv cache return two contiguous tensors when loading MLA KV Cache Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * support fp8 kv cache for MLA kv cache reuse Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-23 19:47:50 +08:00
Anthony Chang	bbea2647b1	Qwen3 supports TRTLLM FP4 MoE backend (#4530 ) * MoE TRTLLM backend for Qwen3 Signed-off-by: Anthony Chang <anchengc@nvidia.com> * add extra moe_backend to test Signed-off-by: Anthony Chang <anchengc@nvidia.com> * address comments Signed-off-by: Anthony Chang <anchengc@nvidia.com> * conditionally compile kernels on newer archs Signed-off-by: Anthony Chang <anchengc@nvidia.com> * missing positional arg Signed-off-by: Anthony Chang <anchengc@nvidia.com> * Update the routing kernels Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Revise usage of TLLM_LOG_ERROR Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Add unit test for Qwen3 moe (trtllm_gen backend) Signed-off-by: Christina Zhang <christinaz@nvidia.com> * improve weight processing speed of moe_backend=TRTLLM; roughly 2x Signed-off-by: Anthony Chang <anchengc@nvidia.com> * tidy and minor fix Signed-off-by: Anthony Chang <anchengc@nvidia.com> * temporarily disable accuracy test that has known issue Signed-off-by: Anthony Chang <anchengc@nvidia.com> --------- Signed-off-by: Anthony Chang <anchengc@nvidia.com> Signed-off-by: Christina Zhang <christinaz@nvidia.com> Co-authored-by: Christina Zhang <christinaz@nvidia.com>	2025-05-23 18:31:08 +08:00
Bo Li	9ae705af1b	perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482 ) * Add Julien's origina kernel. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Get rid of UpdateKVCache functionality. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add kernels. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add torch OP. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update cmake. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Torch OP must use double as argument dtype. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix misaligned access when head_dim=64. In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Remove unroll (compiler can do that). Cleanup code. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add switch for interleave. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Refactor vectorized load/store. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Implement is_neox. Result not correct yet. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix is_neox=True. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add q_weight and k_weight. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> --------- Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-05-23 15:31:04 +08:00
djns99	87f734b563	[https://nvbugs/5297775 ] fix: Correct memory guard for large MOE tests to account for TP space (#4553 ) fix: Correct memory guard for large MOE tests to account for TP space Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-05-23 14:57:49 +12:00
CarstyYou	ef280e687e	[feat] support fp8 blockscale gemm on sm89 (#4481 ) * [feat] integrate ada blockwise gemm Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] align scale M Signed-off-by: CarstyYou <xiy@nvidia.com> * [feat] swizzle mma output Signed-off-by: CarstyYou <xiy@nvidia.com> * [test] add ut for sm89 Signed-off-by: CarstyYou <xiy@nvidia.com> * [delete] remove useless comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] codestyle Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] fix review comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> --------- Signed-off-by: CarstyYou <xiy@nvidia.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-05-23 10:39:10 +08:00
nv-guomingz	e3a534d0ee	chore: guardword clean for header file. (#4540 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 10:08:14 +08:00
pcastonguay	d7d455e7ea	[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243 ) * feat: Enabling dis serving with TRT backend with Python runtime Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing formatting Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing disagg mtp test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-22 22:01:06 -04:00
dongxuy04	338744fba6	fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer (#4573 ) fix moe possible race cond and add bypass worker thread for no updates Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-23 09:24:23 +08:00
nv-guomingz	3549b68c1c	chroe:clean useless flag (#4567 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 07:05:15 +08:00
Mike Iovine	9c0de251db	[feat] Integrate Hopper chunked attention kernels (#4330 ) * Integrate chunked attention kernels Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix cache key Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix lint Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> --------- Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-22 17:10:57 -04:00
Chuang Zhu	558eaecf16	fix sequence data race (#4565 ) stash for debug broken promise Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 23:13:48 +08:00
Chuang Zhu	44cfd757b2	Agent interface impl for NIXL (#4125 ) * agentConnection Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> recv Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> agentState Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> NIXL interfaces Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> update cmakelists Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> nixl improve Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> remove cppzmq Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> transferAgent remove register Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> work for cache Test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> reduce sleep time Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> intergarte Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> nixl env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix rebase error Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> cpp test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> stash for send metaData Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> loadRemoteMD after fetchRemoteMD Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> workaround for mixed gen and context Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> test_env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> avoid port conflict in test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * use std::string Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * typo Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix transferAgentTest Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 09:09:41 +08:00
Nikita Korobov	e1b42be3d1	fix: TRT-LLM Gen dtype declaration (#4503 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-21 23:56:37 +02:00
Zongfei Jing	dbaddb3a29	Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216 ) * Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe Signed-off-by: Shiyu Li <shili@nvidia.com> Adding comments Signed-off-by: Shiyu Li <shili@nvidia.com> Add unittest of the twoshot kernel. Signed-off-by: Shiyu Li <shili@nvidia.com> Update dispatch logic Signed-off-by: Shiyu Li <shili@nvidia.com> Use cpu barrier instead of GPU at init Signed-off-by: Shiyu Li <shili@nvidia.com> Merge dispatch logic fix Signed-off-by: Shiyu Li <shili@nvidia.com> Update the kernel to use GPU-managed buffer Signed-off-by: Shiyu Li <shili@nvidia.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean up Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Simplify AllReduce interface Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix warning Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Skip ut for no_fusion Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Shiyu Li <shili@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Shiyu Li <shili@nvidia.com>	2025-05-22 03:42:36 +08:00
Robin Kobus	cd0c826417	refactor: DisaggExecutorTest (#4398 ) * chore: Improve formatting of DisaggExecutorTest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Typed InstanceRole param in DisaggExecutorTest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Skip DisaggExecutorTest based on device count Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-21 18:01:45 +08:00
Perkz Zheng	6a35c599ef	Clean: fmha codes (#4496 ) clean codes Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-21 11:45:47 +08:00
Ruoqian Guo	db7446fda7	Feat: add deep_gemm swapab Kernel (#4430 ) * feat: add deepgemm_swapab feat: add fp8_gemm_kernel_swapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> feat: set threshold for deepgemm and deepgemmswapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * docs: update README.md Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * fix: std::runtime_error needs #include <stdexcept> Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * chores: remove the redundant code Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * feat: support for dense deep_gemm swapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * chores: remove redundant code Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> --------- Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-05-21 10:48:43 +08:00
Shi Xiaowei	3d62727303	test: NIXL single process test (#4486 )	2025-05-21 10:41:46 +08:00
Thor Johnsen	5d438be59a	[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936 ) * v1.5 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> v1.5.4 Add back draft_overhead to spec dec stats Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v1.5.5: fix CI error Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.6: fix CI error 8196 > 8192 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * precommit run Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v2.0: Address reviewer concerns Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v2.1: add fix from wili Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Revert changes that require use of TypeAlias because that requires python version >= 3.10 Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> --------- Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-21 10:40:00 +08:00
Perkz Zheng	426f6fd2bc	Feat: add chunked-attention kernels on Blackwell (#4394 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add chunked-attention kernels on blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> fix Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-21 10:16:46 +08:00
djns99	a030a898d1	perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-05-21 10:00:36 +08:00
Robin Kobus	8564c5a41f	refactor: Unify request order in TRT and PyTorch workflow (#4096 ) * chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-20 18:49:27 +02:00
dongxuy04	21aff2e313	feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384 ) * first commit of cpp moe loadbalance code Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python bindings for moe load balance Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python wrapper, ut and bug fixes Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add binding for layerId and update binding test Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add host tensor sharing and ut Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-20 17:53:48 +08:00
kanghui0204	6f3922f318	feat: Low Precision Allreduce for PCIe based GPU (#4344 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-20 06:53:46 +08:00
Yuxian Qiu	c8e062bfd3	fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-19 14:25:36 -07:00
Perkz Zheng	1c5b0d6a13	[Feat] add chunked-attention kernels on Hopper (for llama4) (#4291 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add mtp for fmha_v2 MLA kernels and add chunked-attention support for hopper fmha kernels Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-05-19 09:57:10 -07:00
Faraz	7656af1b57	[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) (#4335 ) * add mixtral7x8b fp8 test with fixed cutlass fp8 moe gemm Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * update cutlass versions Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added internal cutlass with fix and docker update Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added mixtral to pro 6000 Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> --------- Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-05-19 08:56:21 -07:00
liji-nv	58e405624a	[https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 (#3952 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-05-19 22:12:25 +08:00
Dom Brown	c45f414bbf	Test: Improve model re-use in C++ DGX tests for CI stability (#4263 ) * Fix padded vocab size for Llama Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Refactor multi GPU llama executor tests, and reuse the built model engines Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Fix test list typo Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Further WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Update test lists and readme Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Try parametrize for asymmetric Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Parametrize + skip unsupported combinations Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> * Update test list Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> * Reduce environment duplicated code Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> --------- Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>	2025-05-19 14:20:21 +01:00
Shi Xiaowei	df2798e0c3	feat: NIXL interface integration (#3934 ) NIXL interfaces Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-05-19 18:18:22 +08:00
Void	62bb7f9286	fix potential issues in allreduce fusion kernel and ut (#4226 ) fix allreduce fuison kernels and ut Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> --------- Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-19 17:38:29 +08:00
Jinyang Yuan	b618e1f55b	perf: Eliminate the need for attention DP padding when possible (#3439 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: raccoonliukai <raccoonliu@tencent.com>	2025-05-17 13:30:55 +08:00
Robin Kobus	4e370a509a	refactor: Copy sequence lengths once in decoder setup (#4102 ) * refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update DecoderInputBuffers to remove duplicated buffers - Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability. - Adjusted references in generateRequestOptions.cpp to align with the new buffer structure. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move getEmbeddingBias to anonymous namespace Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Filter context requests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: GenerateRequestOptions using more fine-grained functions - Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests. - Updated the `operator()` method to utilize the new method, improving code clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMDecoder - Updated the `generate_request_options` call. - Updated the `make_decoding_batch_input_output` call. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove const where we modify input buffers - Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications. - Updated related function calls to ensure compatibility with the new parameter types. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-16 22:03:55 +08:00
Nikita Korobov	fa3879629e	feat: TRT-LLM Gen integration for BMM and MoE refactoring (#4280 ) - Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator. - Refactors TRT-LLM Gen MoE runner to call to BMM interface - The accuracy is verified for DeepSeek R1 FP4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-16 13:31:53 +02:00
ixlmar	f7ad49bb9b	chore: improve log-level setting UX (#4352 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-16 09:47:44 +01:00
Yuan Tong	f5ddb7ab4a	fix: support TensorRT 10.11+ in FindTensorRT.cmake (#4353 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-16 14:04:56 +08:00
NVJiangShao	6cc3f2093a	Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow (#4348 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-16 10:02:30 +08:00
Erin	c44cf34373	fix: update checks that broke medusa tests when use_py_session=True (#4339 ) fix check Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-15 15:47:28 -07:00
yuxianq	4f8afe4cc6	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-16 04:16:53 +08:00
yuxianq	0e87fcc228	refactor: use x is None instead of x == None. (#4244 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-15 20:00:04 +08:00
Yuan Tong	593f65ff6a	fix: better method to help torch find nvtx3 (#4110 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-15 16:42:30 +08:00
zhhuang-nv	97bc680cd8	feat: support kv cache reuse for MLA (#3571 ) * support kv cache reuse for MLA load compressed_kv and k_pe and do up-projection use 192/128 head size MLA context kernel support Blackwell and Hopper now Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * add CI test Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2 Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * use GPTJ style RoPE for MLA Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix rebase error and some docs Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix kv_lens Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * tiny fix Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: use normal device memory instead of pinned memory for unit test Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * fix L0 tests Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile after rebase Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments again Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com> Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-05-15 15:22:21 +08:00
Zhanrui Sun	5dc3b539ba	infra: Down the gcc toolset version from 13 to 11 (#4114 ) * Down the gcc toolset version from 13 to 11 Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update rocky8 images Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-05-15 11:08:51 +08:00
qsang-nv	0fd59d64ab	infra: open source fmha v2 kernels (#4185 ) * add fmha repo Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix code style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header kernel_traits.h Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add .gitignore file Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add SLIDING_WINDOW_ATTENTION Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update setup.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update build_wheel.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> --------- Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-05-15 10:56:34 +08:00
QI JUN	498ce8a056	Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340 ) Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)" This reverts commit `5e634dd1bd`.	2025-05-15 09:52:39 +08:00
hlu1	7fb0af9320	[fix] Remove stale cublas heuristics (#4326 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-14 17:35:51 -07:00
Robin Kobus	d31fefde2c	[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092 ) * chore: Remove GptSession/V1 from TRT workflow Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove stateful decoders Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession buffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession utils Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession kernels Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove V1 GPT models from tests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove gptSessionBenchmark from scripts and docs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove gptSession IO classes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession from test lists Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession from docs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove useless encoder test Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove mActualBatchSize from DecoderState Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove static batching from ExecutorTest - Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter. - Adjusted related test functions to reflect the changes in parameter lists. - Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-14 23:10:04 +02:00
Robin Kobus	c67da1fbaa	fix: Eagle decoding in TRT flow (#4229 ) * fix: EagleBuffers lifetime issue Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Clean up Eagle kernel parameters Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Eagle draft tokens init Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Add check for updated sequence length in TrtGptModelInflightBatching Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Skip check for beam search Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-14 16:10:49 +02:00
DylanChen-NV	206f82115d	[bug/5247505] fix: CP accuracy on Blackwell (#4188 ) * fix xqa params for cp Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * try adding B200 multi gpu test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add accuracy tests for cp Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> --------- Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-05-14 17:40:50 +08:00
kanghui0204	5e634dd1bd	feat: Low Precision Allreduce for PCIe based GPU (#3851 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-14 16:45:43 +08:00
Barry Kang	20b42912ce	[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123 ) Support DeepSeek-R1 W4A8 on Hopper Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-05-14 15:48:07 +08:00
Perkz Zheng	e8d7834c50	fix: [https://nvbugspro.nvidia.com/bug/5238626 ] illegal memory address when running llama 4 with cuda graph enabled (#4101 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-13 14:58:54 +08:00
pcastonguay	9643be5f20	[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156 ) * feat: Add per-request stats support with PyT backend Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Adding unit test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing stats unit test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing test with overlap Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-12 21:35:15 -04:00
Simeng Liu	286a789549	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 ) * feat: Add heuristic for GroupRMSNorm kernel selection. Implements a logistic regression model to dynamically select between: - GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions (better SM occupancy in most cases) - GroupRMSNormLargeBatch: Allocates warps proportional to max dimension (better block scheduling in large batch scenarios) Selection heuristic considers batch size, allocated warps, and scheduling efficiency on the current GPU architecture. Models for Compute Capability 9.x and 10.x are trained base on nsys kernel runtime data. The default kernel selection is the base kernel. The python operator group_rms_norm will use the heuristic by default. User can pick to use the base or large batch kernels as well. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Address the comments. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-13 08:52:53 +08:00
wili	eba3623a54	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 ) * feat/vbws-part4-v1.8: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * feat/vbws-part4-v1.9: fix incorrect output when using short output length Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.1: remove useless variables Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.2:fix incorrect output when using short output length Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.3: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.4: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.5: remove API change Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-12 22:32:29 +02:00
Yixin Dong	c90ebadd84	feat: Support the Structural Tag in guided decoding (#4066 ) * finish Signed-off-by: Ubospica <ubospica@gmail.com> * update Signed-off-by: Ubospica <ubospica@gmail.com> * update Signed-off-by: Ubospica <ubospica@gmail.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * exc overlap scheduler Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * add test Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix api ref Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Ubospica <ubospica@gmail.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-12 17:24:50 +08:00
Perkz Zheng	3f29d2f006	Feat: support exporting softmax statistics and update the kernel-selection heuristic (#4155 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * support exporting softmax statistics and update the kernel-selection heuristic Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-12 15:31:46 +08:00
Dom Brown	2d0f93a054	Refactor: Restructure C++ tests for better modularisation of non-shared code (#4027 ) * Refactor: Restructure C++ tests for better modularisation of non-shared code Start cleanup of pytest code for C++ tests Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Clean up names and remove references to test_cpp.py Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> WIP Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Move multi-GPU code Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Update doc and try un-waiving Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Update multi GPU file check Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Address minor multi-GPU setup bug Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> --------- Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-05-09 19:16:51 +01:00
zhhuang-nv	0a36db0aa4	[fix] trtllm-gen mla kernel warnings (#4119 ) fix trtllm-gen mla kernel warnings Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-09 20:21:28 +08:00
NVJiangShao	57b2fe2019	[#4085 ][fix] Fix `apply_per_channel_scale` for extremely large input sequence length. (#4089 ) Fix apply_per_channel_scale for extremely large input seq length. Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: crazy-JiangDongHua <759421566@qq.com>	2025-05-09 11:57:01 +08:00
Yi Zhang	91bf5e6a8e	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 ) Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-05-09 11:04:01 +08:00
Yukun He	5b61486d87	chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 10:20:41 +08:00
forrestl	9477661f4c	Support RingAttention in the BertAttention plugin and the DiT model (#3661 ) support ring attn for bert_attention plugin and dit model Signed-off-by: ChunhuanLin <lch_xdu@163.com>	2025-05-09 08:06:54 +08:00
chenfeiz0326	7f5716ef83	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 ) * feat: TRT-LLM Gen FP8 MoE Llama4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * feat: TRT-LLM Gen llama4 MoE Top1 routing Signed-off-by: Jiqun Tu <jtu@nvidia.com> * feat: add per tensor FP8 TRT-LLM Gen GEMMs Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add guard for routingIndicesClusterKernel Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> --------- Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> Signed-off-by: Jiqun Tu <jtu@nvidia.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Co-authored-by: Nikita Korobov <nkorobov@nvidia.com> Co-authored-by: Jiqun Tu <jtu@nvidia.com>	2025-05-08 14:13:01 -07:00
Yukun He	bb7bcc75c2	feat: Fallback to NCCL for various patterns when input size is large. (#4080 ) * Fallback to NCCL for various patterns when input size is large. Move the previous implementation to cpp side. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Revising. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-08 11:13:13 -07:00
nv-guomingz	4dfa3ccf43	chore: enhance the cmake experience by ignoring the additional semicolon (#3992 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-08 18:43:36 +08:00
Simeng Liu	bb766eca0a	feat: Reduce branch overhead in groupRMSNorm kernels (#4067 ) * feat: Reduce branch overhead in groupRMSNorm kernels * Fix race condition with sm < 90 and avoid all threads in one warp writing to the same shared memory. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-08 00:55:27 +08:00
Yan Chunwei	0c26059703	chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732 ) * beam_width and max_new_token Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove beam_width Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove min_length Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove return_num_sequences Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-05-07 13:20:25 +08:00
Chuang Zhu	09a28becae	fix cache buffer (#3942 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-07 09:49:44 +08:00
Daniel Cámpora	c56a2aca46	fix: Properly get decoding mode according to same logic as cpp. (#4026 ) * Properly get decoding mode according to same logic as cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Cross reference getDecodingMode implementations in pytorch - cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Better bindings for DecodingMode. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert to version in main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert configuration.py. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-06 21:53:17 +08:00
Robin Kobus	72057a0a64	[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625 ) * disable overlap in encoder Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: overlap same batch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Improve early exit Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 15:06:46 +02:00
dominicshanshan	3ac6637005	fix: trtllm-serve hang in stress test and ds v3 stress parameter update (#3836 ) * Remove stdout pipe for genai-perf and make stress time as public parameter. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> * Update llmRequest based on comment. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> * launch process function refactor. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> --------- Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-05-06 16:52:30 +08:00
Robin Kobus	e943ad5a2a	[https://nvbugs/5247414 ] fix: draft/target probs shape (#4055 ) Shape was wrongly changed in DecoderState introduction. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 09:56:43 +02:00
Yuan Tong	4b6c19737b	feat: support add internal cutlass kernels as subproject (#3658 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-06 11:35:07 +08:00
brb-nv	5b1aeb6730	test: Test OOB access issue in penaltyKernel for endId=-1 (#4035 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-05-05 10:24:28 -07:00
Mike Iovine	8caf200322	[fix] Skip debugCheckSemaphores in stream capture mode (#4032 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-05 10:24:10 -07:00
Robin Kobus	ccff86068e	fix: request termination in pipeline parallelism (#3892 ) * feat: Implement synchronous request termination in batch manager - Added `terminateRequestSync` method to `TrtEncoderModel` and `TrtGptModelInflightBatching` for handling request termination in the next `forwardSync` call. - Updated existing request termination logic to utilize the new synchronous method, ensuring generated tokens are cleared appropriately. - Enhanced logging for clarity in token management during request processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: MockedModelCancelRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: terminate with timeout Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * docs: Update doc string for allottedTimeMs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-05 21:51:41 +08:00
Robin Kobus	9f9edd783c	refactor: Introduce MpiTag enumeration and update MPI function signatures (#3893 ) * refactor: Move executor recv functions into classes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance MPI logging and error handling - Updated MPI logging to include destination and tag information for better traceability during send and receive operations. - Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests. - Improved code structure for clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce MpiTag enumeration and update MPI function signatures - Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability. - Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags. - Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity. - Removed redundant MPI tag constants from several classes, streamlining the code. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Rename tags for consistency Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-04 13:24:29 +02:00
Robin Kobus	403370af62	refactor: Move ModelSpec to core library (#3980 ) * refactor: Move ModelSpec from tests to core library Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move ModelSpec from runtime to separatedir Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use new bindings path and clean up Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Updated licenses Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove script_dir from path Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-04 01:39:09 +08:00
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Erin	83f37614ef	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 ) * support return logprob in llmapi Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> update and add test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> stability test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * revert removal of old flag Signed-off-by: Erin Ho <erinh@nvidia.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Erin Ho <erinh@nvidia.com>	2025-05-01 12:47:14 -04:00
YueWeng	b1621e8d4e	feat: add relaxed acceptance for DS (#3865 ) * add relaxed acceptance for DS R1 Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * clean and update docs Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * Modified based on review Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix mtp manager issue Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> --------- Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-01 21:50:36 +08:00
hlu1	1294ecb12f	Add attention workspace memory check (#3970 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-30 23:51:09 -07:00

1 2 3 4 5 ...

462 Commits