TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 04:03:22 +08:00

Author	SHA1	Message	Date
Yuxian Qiu	bf691b3d28	feat: support packed weights in vanilla moe (#4719 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-29 06:24:24 +08:00
Yan Chunwei	5506f60037	chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs (#4603 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-05-28 18:43:04 +08:00
ixlmar	fbe4db207d	feat: forward exceptions to Python and catch OOMs (#4497 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-28 11:58:10 +02:00
amirkl94	fbec0c3552	Release 0.20 to main (#4577 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: Simeng Liu <simengl@nvidia.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: stnie <82932102+stnie@users.noreply.github.com> Co-authored-by: Simeng Liu <109828133+SimengLiu-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-05-28 16:25:33 +08:00
Bo Li	9c4b8f66b4	feat: Integration of Fused QKNorm+RoPE. (#4611 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-05-28 11:20:45 +08:00
Shunkangz	6493401986	Fix handle cancel request for attentionDP (#4648 ) Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-05-28 11:04:02 +08:00
Yuxian Qiu	5700a4ffcd	feat: Add vanilla MOE. (#4682 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-28 10:44:14 +08:00
Yuxian Qiu	e538b0d95e	refactor: extract and reuse filter_weights. (#4681 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-27 19:48:01 +08:00
Lucas Liebenwein	5cdd6bb10f	[AutoDeploy] Increased Model Coverage Mass Migration Week 1 (#4468 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: sugunav14 <178320438+sugunav14@users.noreply.github.com> Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com> Co-authored-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-05-27 16:43:15 +08:00
QI JUN	1582361400	Chore: only pad one dummy request for attention dp scenario (#4664 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-05-27 14:56:22 +08:00
Enwei Zhu	88190faa34	feat: large-scale EP(part 4: Static EP load balancer integration) (#4615 ) * MoeLoadBalancerConfig Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * MoeLoadBalancer integration Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * config file Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * test Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * test Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-26 18:25:11 +08:00
QI JUN	44eb053b95	introduce RequestQueueItem class instead of using tuple (#4649 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-05-26 17:34:53 +08:00
QI JUN	4a81991b65	Chore: refine shutdown signal of PyExecutor (#4614 ) * refine shutdown signal of PyExecutor Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-05-26 11:14:54 +08:00
Yuxian Qiu	8f055f5d14	feat: Skip sampler for intermediate pp stages. (#4514 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-26 10:08:51 +08:00
shaharmor98	2b8f6d2871	Fix snake case format (#4559 ) fix snake case format Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-25 17:57:17 +08:00
Robin Kobus	7b2818a47b	refactor: CreateNewDecoderRequests (#4452 ) * refactor: CreateNewDecoderRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Consolidate request generation in CreateNewDecoderRequests - Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests. - Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options. - Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy. - Cleaned up associated includes and references throughout the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Simplify request handling in CreateNewDecoderRequests - Removed the generateRequestOptions method and integrated its logic directly into the operator() method. - Updated the request generation process to improve clarity and reduce redundancy. - Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests - Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling. - Removed redundant request generation logic from the operator() method, streamlining the process. - Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests - Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management. - Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality. - Cleaned up unnecessary includes and improved code organization for better maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters - Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder. - Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling. - Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests - Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency. - Updated bindings and method signatures for decoder stream handling. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 22:54:37 +08:00
zhhuang-nv	8452775db8	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 ) * optimize kv cache reuse workflow for MLA write kv cache first and only call up-projection GEMM once relax contiguous requirements of k/v for setting paged kv cache return two contiguous tensors when loading MLA KV Cache Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * support fp8 kv cache for MLA kv cache reuse Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-23 19:47:50 +08:00
Anthony Chang	bbea2647b1	Qwen3 supports TRTLLM FP4 MoE backend (#4530 ) * MoE TRTLLM backend for Qwen3 Signed-off-by: Anthony Chang <anchengc@nvidia.com> * add extra moe_backend to test Signed-off-by: Anthony Chang <anchengc@nvidia.com> * address comments Signed-off-by: Anthony Chang <anchengc@nvidia.com> * conditionally compile kernels on newer archs Signed-off-by: Anthony Chang <anchengc@nvidia.com> * missing positional arg Signed-off-by: Anthony Chang <anchengc@nvidia.com> * Update the routing kernels Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Revise usage of TLLM_LOG_ERROR Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Add unit test for Qwen3 moe (trtllm_gen backend) Signed-off-by: Christina Zhang <christinaz@nvidia.com> * improve weight processing speed of moe_backend=TRTLLM; roughly 2x Signed-off-by: Anthony Chang <anchengc@nvidia.com> * tidy and minor fix Signed-off-by: Anthony Chang <anchengc@nvidia.com> * temporarily disable accuracy test that has known issue Signed-off-by: Anthony Chang <anchengc@nvidia.com> --------- Signed-off-by: Anthony Chang <anchengc@nvidia.com> Signed-off-by: Christina Zhang <christinaz@nvidia.com> Co-authored-by: Christina Zhang <christinaz@nvidia.com>	2025-05-23 18:31:08 +08:00
bhsueh_NV	d69c662215	[Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP (#4575 ) * fix bug of qwen3 fp4 workflow with EP Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of qwen3_moe with ep Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-23 13:34:05 +08:00
pcastonguay	d7d455e7ea	[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243 ) * feat: Enabling dis serving with TRT backend with Python runtime Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing formatting Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing disagg mtp test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-22 22:01:06 -04:00
QI JUN	1e55d616da	Chore: clean up _gather_dp_requests_num method of PyExecutor (#4571 ) clean up _gather_dp_requests_num method of PyExecutor Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-05-23 08:37:39 +08:00
Mike Iovine	9c0de251db	[feat] Integrate Hopper chunked attention kernels (#4330 ) * Integrate chunked attention kernels Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix cache key Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix lint Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> --------- Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-22 17:10:57 -04:00
QI JUN	1e5d526db4	Chore: clean up _merge_dummy_request method of PyExecutor (#4438 ) * clean up _merge_dummy_request method of PyExecutor Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * update comment Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-05-22 18:19:07 +08:00
Yan Chunwei	4798d088d9	chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs (#3823 ) * partition LlmArgs Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * update backend Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-05-22 09:40:56 +08:00
Chuang Zhu	44cfd757b2	Agent interface impl for NIXL (#4125 ) * agentConnection Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> recv Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> agentState Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> NIXL interfaces Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> update cmakelists Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> nixl improve Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> remove cppzmq Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> transferAgent remove register Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> work for cache Test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> reduce sleep time Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> intergarte Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> nixl env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix rebase error Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> cpp test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> stash for send metaData Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> loadRemoteMD after fetchRemoteMD Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> workaround for mixed gen and context Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> test_env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> avoid port conflict in test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * use std::string Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * typo Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix transferAgentTest Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 09:09:41 +08:00
Zongfei Jing	dbaddb3a29	Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216 ) * Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe Signed-off-by: Shiyu Li <shili@nvidia.com> Adding comments Signed-off-by: Shiyu Li <shili@nvidia.com> Add unittest of the twoshot kernel. Signed-off-by: Shiyu Li <shili@nvidia.com> Update dispatch logic Signed-off-by: Shiyu Li <shili@nvidia.com> Use cpu barrier instead of GPU at init Signed-off-by: Shiyu Li <shili@nvidia.com> Merge dispatch logic fix Signed-off-by: Shiyu Li <shili@nvidia.com> Update the kernel to use GPU-managed buffer Signed-off-by: Shiyu Li <shili@nvidia.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean up Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Simplify AllReduce interface Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix warning Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Skip ut for no_fusion Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Shiyu Li <shili@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Shiyu Li <shili@nvidia.com>	2025-05-22 03:42:36 +08:00
dongxuy04	4018806742	feat: large-scale EP(part 3 - refactor: FusedMoe for redundant expert) (#4495 ) refactor fused_moe for redundant expert Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-21 17:17:49 +08:00
Thor Johnsen	5d438be59a	[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936 ) * v1.5 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> v1.5.4 Add back draft_overhead to spec dec stats Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v1.5.5: fix CI error Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.6: fix CI error 8196 > 8192 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * precommit run Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v2.0: Address reviewer concerns Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v2.1: add fix from wili Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Revert changes that require use of TypeAlias because that requires python version >= 3.10 Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> --------- Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-21 10:40:00 +08:00
Yuxian Qiu	62c16b6d37	fix: skip weights defined in create_weights for pp. (#4447 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-21 10:13:20 +08:00
Rohan Varma	3d940e77f0	[TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug (#4290 ) * add bidirectional support and fix EarlyStopDecoder unsqueeze to be compatible with LogitsStorage Signed-off-by: Rohan Varma <rohanv@nvidia.com> * run pre-commit Signed-off-by: Rohan Varma <rohanv@nvidia.com> * instead of bidirectional flag use ModelConfig.is_generation Signed-off-by: Rohan Varma <rohanv@nvidia.com> * fix unit test to extract logits from correct dim Signed-off-by: Rohan Varma <rohanv@nvidia.com> --------- Signed-off-by: Rohan Varma <rohanv@nvidia.com>	2025-05-20 10:15:36 -07:00
Robin Kobus	8564c5a41f	refactor: Unify request order in TRT and PyTorch workflow (#4096 ) * chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-20 18:49:27 +02:00
Daniel Cámpora	f038218f83	fix: Fix TRTLLMSampler beam width bug. (#4473 ) * Fix TRTLLMSampler. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added type hint. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-20 18:00:14 +02:00
dongxuy04	21aff2e313	feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384 ) * first commit of cpp moe loadbalance code Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python bindings for moe load balance Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python wrapper, ut and bug fixes Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add binding for layerId and update binding test Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add host tensor sharing and ut Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-20 17:53:48 +08:00
Lucas Liebenwein	de409e8468	[AutoDeploy] HF factory improvements (#4371 ) * [AutoDeploy] HF factory improvements Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * improve monkey-patches and add unit tests Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-19 20:13:43 -07:00
kanghui0204	6f3922f318	feat: Low Precision Allreduce for PCIe based GPU (#4344 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-20 06:53:46 +08:00
liji-nv	58e405624a	[https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 (#3952 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-05-19 22:12:25 +08:00
Yukun He	98018f3bb9	Downgrade the logger level for fallback tactic warning. (#4440 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-19 18:26:54 +08:00
Yuxian Qiu	cf6cd940e5	feat: Add pp support for hybrid attn/mamba model (#4358 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-19 14:47:45 +08:00
shaharmor98	27afcb9928	add changes for fp8, nemotron-nas, API (#4180 ) Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-18 23:27:25 +08:00
Kaiyu Xie	3e08cd231c	fix: Remove real size allocation (#4396 ) Remove real size allocation Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-05-18 19:13:22 +08:00
Jinyang Yuan	b618e1f55b	perf: Eliminate the need for attention DP padding when possible (#3439 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: raccoonliukai <raccoonliu@tencent.com>	2025-05-17 13:30:55 +08:00
Lucas Liebenwein	7c85890ec7	[AutoDeploy] eager pattern matcher new pattern (#4370 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-16 12:35:44 -04:00
Lucas Liebenwein	0e872ef0b0	[AutoDeploy] fix: proper process group clean up (#4373 ) [AutoDeploy] proper process group clean up Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-16 12:35:25 -04:00
Netanel Haber	9cd8148f28	API Breaking Change + Readability: "decoder"->"sampler" (#4121 ) * decoder->sampler; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors * Breaking Change, as it changes public interfaces, main changes: * PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler. * Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py. --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-05-16 23:52:25 +08:00
ixlmar	13b61405e8	fix: improve PyExecutor resource allocations (#4299 ) chore: restore symmetry of worker start/shutdown chore: fix return type of cal_max_tokens chore: type some more return values fix: free resources before re-claiming Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-16 16:28:10 +01:00
Lucas Liebenwein	8e4320ede5	[AutoDeploy] configurable cache resize (#4372 ) Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-16 10:07:09 -04:00
Robin Kobus	4e370a509a	refactor: Copy sequence lengths once in decoder setup (#4102 ) * refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update DecoderInputBuffers to remove duplicated buffers - Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability. - Adjusted references in generateRequestOptions.cpp to align with the new buffer structure. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move getEmbeddingBias to anonymous namespace Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Filter context requests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: GenerateRequestOptions using more fine-grained functions - Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests. - Updated the `operator()` method to utilize the new method, improving code clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMDecoder - Updated the `generate_request_options` call. - Updated the `make_decoding_batch_input_output` call. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove const where we modify input buffers - Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications. - Updated related function calls to ensure compatibility with the new parameter types. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-16 22:03:55 +08:00
Fridah-nv	bce281d592	feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) (#3638 ) * add docstring to summarize current rope support Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor: replace call_method, adjust inserting order of cos_sin_cache calculation node Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * add unit test for triton rope and ds rope Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * update rope matcher to match DS RoPE, add custom op for reference, add unit test case Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * cache cos[pos_idx].unsqueeze and sin[pos_idxs].unsqueeze Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor doc update Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * separate pattern matching and optimization for explicit and complex rope + minor updates Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * clean rope impl in repo Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * replace fused_flattened_mla_with_cache's rope impl with torch_apply_rope_with_qk_interleaving, update unit test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * separate layout infer and transpose to a new transformation Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * update rope_with_explicit_freqs and rope_with_input_interleaved to expose unsqueeze_dim and support match_rope_layout, add unit tests Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * solve merge conflict in transform.py, need to fix optimize_rope with cuda graph capture Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor clean up after rebase Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com> * fix pre-commit Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * support map to bnsd layout and infer unsqueeze_dim from op Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * fix cos/sin not the same across prompts in the same batch issue when mapping to flashinfer op Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * fix for unit test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * fix custom op input/output node ordering issue for DeepSeek V3 rope Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * clean code Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * minor Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * move flattening of cos_sin_cache to the graph, update flashinfer op docstring and test Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> * debug transform unit test failure Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> --------- Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>	2025-05-16 09:55:32 -04:00
NVJiangShao	a6f2a1e918	Fix test_fused_moe_w4afp8 (#4393 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-05-16 17:21:33 +08:00
Daniel Cámpora	df19430629	chore: Mass Integration 0.19 (#4255 ) * fix: Fix/fused moe 0.19 (#3799) * fix bug of stream init Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix: Add pre-download of checkpoint before benchmark. (#3772) * Add pre-download of checkpoint before benchmark. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add missing remote code flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Move from_pretrained to throughput benchmark. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Move download and use snapshot_download. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Removed trusted flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix benchmark command in iteration log test. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fuse Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * TRTLLM-4875 feat: Add version switcher to doc (#3871) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * waive a test (#3897) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> * fix: remote mpi session abort (#3884) * fix remote mpi session Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * skip fp8 gemm for pre-hopper (#3931) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975) * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update multigpu list Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix namings Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * Doc: Fix H200 DeepSeek R1 perf doc (#4006) * fix doc Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> * update perf number Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> * Fix the perf regression caused by insufficient cache warmup. (#4042) Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * doc: Update 0.19.0 release notes (#3976) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * Optimize the AutoTuner cache access code to reduce host code overhead. (#4060) The NVFP4 Linear op is very sensitive to the host overhead. This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Update switcher (#4098) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * doc: update release notes (#4108) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> * docs:update 0.19 doc. (#4120) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> * docs:add torch flow supported model list. (#4129) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> * doc: Release V0.19 Perf Overview Update (#4166) Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> * Fix readme of autodeploy. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update tensorrt_llm/_torch/pyexecutor/llm_request.py Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Revert mgmn worker node. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Change to disable_overlap_scheduler. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>	2025-05-16 10:53:25 +02:00

1 2 3 4 5 ...

282 Commits