TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Enwei Zhu	5b4852b7b5	feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-02 01:25:02 +08:00
Netanel Haber	2ce05c3ab4	'entered copyBlock' format string expects %s, pass string rather than int (#4820 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-01 08:54:33 -07:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Daniel Cámpora	69c7fe8905	[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-01 03:32:43 +08:00
Enwei Zhu	25dde49c28	fix: EP load balancer with MTP layer and route offset by EP rank (#4767 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 00:07:44 +08:00
Chuang Zhu	f117d6abe9	Fabric Memory for KV Cache Transfer (#4717 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-30 15:50:21 +08:00
Thor Johnsen	55d56f8155	[JIRA-5226219][fix] Fix Bug in KV cache manager (#4596 ) Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com>	2025-05-29 22:03:20 -07:00
Jinyang Yuan	5339d367ce	[perf] Reduce the workspace size of FP4 activation scales for MoE (#4303 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-05-30 09:03:52 +08:00
Yilin Fan	31bb650298	Cherry pick feat/llama4 to main (#4739 ) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com> Co-authored-by: Chenfei Zhang <chenfeiz@nvidia.com>	2025-05-30 05:28:40 +08:00
Robin Kobus	79a94a28f9	refactor: unique_ptr instead of shared_ptr (#4697 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-29 22:49:35 +02:00
Jhao-Ting Chen	fcadce9f8d	[fix] Eagle-2 LLMAPI pybind argument fix. (#3967 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-29 12:23:25 -07:00
Arthur Rasmusson	812b1abf86	feature: KV Cache GPUDirect Storage (#3209 ) Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-28 23:27:43 +00:00
Robin Kobus	12763779c4	chore: Clean up cpp runtime (#4449 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-28 16:32:59 +02:00
ixlmar	fbe4db207d	feat: forward exceptions to Python and catch OOMs (#4497 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-28 11:58:10 +02:00
Kaiyu Xie	b800adc65c	Fix: hang on disagg when MNNVL two-shot AllReduce is enabled (#4678 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-05-28 13:03:53 +08:00
yunruis	29ac4c20e0	fix: fix dsr1 min lat cga ar rate drop(0.2) (#4561 ) Signed-off-by: yunruis <yunruis@nvidia.com>	2025-05-27 21:59:57 +08:00
Robin Kobus	93a54457ac	[nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608 ) (#4621 ) - Moved sorting related logic to a dedicated function for better clarity and maintainability. - Enhanced sorting logic to separate finished context requests from ongoing ones before sorting by Lora task ID. - Updated function documentation to reflect the sorting behavior and its purpose. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-26 17:10:55 +08:00
Robin Kobus	502758aaa9	fix: Handle additional model outputs based on pipeline parallel rank (#4498 ) - Only allocate additional outputs on last pipeline parallel rank in trtGptModelInflightBatching and executorImpl. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-26 09:04:40 +02:00
Perkz Zheng	4d711be8f4	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 ) * move cubins to LFS Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add sliding-window-attention generation-phase kernels on Blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-26 09:06:33 +08:00
shaharmor98	2b8f6d2871	Fix snake case format (#4559 ) fix snake case format Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-25 17:57:17 +08:00
Chuang Zhu	b60846b47d	fix datatype check (#4606 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-24 08:36:17 +08:00
Yao Yao	ef763b0ddc	fix: rename some terms (#4534 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-05-23 23:23:49 +08:00
Robin Kobus	7b2818a47b	refactor: CreateNewDecoderRequests (#4452 ) * refactor: CreateNewDecoderRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Consolidate request generation in CreateNewDecoderRequests - Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests. - Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options. - Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy. - Cleaned up associated includes and references throughout the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Simplify request handling in CreateNewDecoderRequests - Removed the generateRequestOptions method and integrated its logic directly into the operator() method. - Updated the request generation process to improve clarity and reduce redundancy. - Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests - Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling. - Removed redundant request generation logic from the operator() method, streamlining the process. - Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests - Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management. - Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality. - Cleaned up unnecessary includes and improved code organization for better maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters - Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder. - Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling. - Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests - Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency. - Updated bindings and method signatures for decoder stream handling. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 22:54:37 +08:00
zhhuang-nv	8452775db8	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 ) * optimize kv cache reuse workflow for MLA write kv cache first and only call up-projection GEMM once relax contiguous requirements of k/v for setting paged kv cache return two contiguous tensors when loading MLA KV Cache Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * support fp8 kv cache for MLA kv cache reuse Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-23 19:47:50 +08:00
Anthony Chang	bbea2647b1	Qwen3 supports TRTLLM FP4 MoE backend (#4530 ) * MoE TRTLLM backend for Qwen3 Signed-off-by: Anthony Chang <anchengc@nvidia.com> * add extra moe_backend to test Signed-off-by: Anthony Chang <anchengc@nvidia.com> * address comments Signed-off-by: Anthony Chang <anchengc@nvidia.com> * conditionally compile kernels on newer archs Signed-off-by: Anthony Chang <anchengc@nvidia.com> * missing positional arg Signed-off-by: Anthony Chang <anchengc@nvidia.com> * Update the routing kernels Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Revise usage of TLLM_LOG_ERROR Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Add unit test for Qwen3 moe (trtllm_gen backend) Signed-off-by: Christina Zhang <christinaz@nvidia.com> * improve weight processing speed of moe_backend=TRTLLM; roughly 2x Signed-off-by: Anthony Chang <anchengc@nvidia.com> * tidy and minor fix Signed-off-by: Anthony Chang <anchengc@nvidia.com> * temporarily disable accuracy test that has known issue Signed-off-by: Anthony Chang <anchengc@nvidia.com> --------- Signed-off-by: Anthony Chang <anchengc@nvidia.com> Signed-off-by: Christina Zhang <christinaz@nvidia.com> Co-authored-by: Christina Zhang <christinaz@nvidia.com>	2025-05-23 18:31:08 +08:00
Bo Li	9ae705af1b	perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482 ) * Add Julien's origina kernel. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Get rid of UpdateKVCache functionality. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add kernels. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add torch OP. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update cmake. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Torch OP must use double as argument dtype. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix misaligned access when head_dim=64. In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Remove unroll (compiler can do that). Cleanup code. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add switch for interleave. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Refactor vectorized load/store. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Implement is_neox. Result not correct yet. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix is_neox=True. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add q_weight and k_weight. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> --------- Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-05-23 15:31:04 +08:00
CarstyYou	ef280e687e	[feat] support fp8 blockscale gemm on sm89 (#4481 ) * [feat] integrate ada blockwise gemm Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] align scale M Signed-off-by: CarstyYou <xiy@nvidia.com> * [feat] swizzle mma output Signed-off-by: CarstyYou <xiy@nvidia.com> * [test] add ut for sm89 Signed-off-by: CarstyYou <xiy@nvidia.com> * [delete] remove useless comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] codestyle Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] fix review comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> --------- Signed-off-by: CarstyYou <xiy@nvidia.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-05-23 10:39:10 +08:00
nv-guomingz	e3a534d0ee	chore: guardword clean for header file. (#4540 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 10:08:14 +08:00
pcastonguay	d7d455e7ea	[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243 ) * feat: Enabling dis serving with TRT backend with Python runtime Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing formatting Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing disagg mtp test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-22 22:01:06 -04:00
dongxuy04	338744fba6	fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer (#4573 ) fix moe possible race cond and add bypass worker thread for no updates Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-23 09:24:23 +08:00
nv-guomingz	3549b68c1c	chroe:clean useless flag (#4567 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 07:05:15 +08:00
Mike Iovine	9c0de251db	[feat] Integrate Hopper chunked attention kernels (#4330 ) * Integrate chunked attention kernels Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix cache key Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix lint Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> --------- Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-22 17:10:57 -04:00
Chuang Zhu	558eaecf16	fix sequence data race (#4565 ) stash for debug broken promise Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 23:13:48 +08:00
Chuang Zhu	44cfd757b2	Agent interface impl for NIXL (#4125 ) * agentConnection Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> recv Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> agentState Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> NIXL interfaces Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> update cmakelists Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> nixl improve Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> remove cppzmq Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> transferAgent remove register Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> work for cache Test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> reduce sleep time Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> intergarte Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> nixl env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> fix rebase error Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> cpp test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> stash for send metaData Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> loadRemoteMD after fetchRemoteMD Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> workaround for mixed gen and context Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> test_env Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> avoid port conflict in test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * use std::string Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * typo Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix transferAgentTest Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 09:09:41 +08:00
Nikita Korobov	e1b42be3d1	fix: TRT-LLM Gen dtype declaration (#4503 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-21 23:56:37 +02:00
Zongfei Jing	dbaddb3a29	Adding two-shot allreduce kernel and mnnvl multicasting buffer (#4216 ) * Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe Signed-off-by: Shiyu Li <shili@nvidia.com> Adding comments Signed-off-by: Shiyu Li <shili@nvidia.com> Add unittest of the twoshot kernel. Signed-off-by: Shiyu Li <shili@nvidia.com> Update dispatch logic Signed-off-by: Shiyu Li <shili@nvidia.com> Use cpu barrier instead of GPU at init Signed-off-by: Shiyu Li <shili@nvidia.com> Merge dispatch logic fix Signed-off-by: Shiyu Li <shili@nvidia.com> Update the kernel to use GPU-managed buffer Signed-off-by: Shiyu Li <shili@nvidia.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Clean up Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Simplify AllReduce interface Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix warning Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Rename Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Skip ut for no_fusion Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Refine Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Shiyu Li <shili@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Shiyu Li <shili@nvidia.com>	2025-05-22 03:42:36 +08:00
Ruoqian Guo	db7446fda7	Feat: add deep_gemm swapab Kernel (#4430 ) * feat: add deepgemm_swapab feat: add fp8_gemm_kernel_swapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> feat: set threshold for deepgemm and deepgemmswapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * docs: update README.md Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * fix: std::runtime_error needs #include <stdexcept> Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * chores: remove the redundant code Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * feat: support for dense deep_gemm swapab Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> * chores: remove redundant code Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> --------- Signed-off-by: Ruoqian Guo <ruoqiang@nvidia.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-05-21 10:48:43 +08:00
Thor Johnsen	5d438be59a	[TRTLLM-5000][feat] Pytorch implementation of ngram drafter (#3936 ) * v1.5 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> v1.5.4 Add back draft_overhead to spec dec stats Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v1.5.5: fix CI error Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.6: fix CI error 8196 > 8192 Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * Address reviewer concerns Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * precommit run Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> * v2.0: Address reviewer concerns Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v2.1: add fix from wili Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * Revert changes that require use of TypeAlias because that requires python version >= 3.10 Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> --------- Signed-off-by: Thor Johnsen <41591019+thorjohnsen@users.noreply.github.com> Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-21 10:40:00 +08:00
Perkz Zheng	426f6fd2bc	Feat: add chunked-attention kernels on Blackwell (#4394 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add chunked-attention kernels on blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> fix Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-21 10:16:46 +08:00
djns99	a030a898d1	perf: Fuse gemm setup function for SM90/SM100 MOE plugin path (#4146 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-05-21 10:00:36 +08:00
Robin Kobus	8564c5a41f	refactor: Unify request order in TRT and PyTorch workflow (#4096 ) * chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! chore: Partition context requests in MicroBatchScheduler Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-20 18:49:27 +02:00
dongxuy04	21aff2e313	feat: large-scale EP(part 2: MoE Load Balancer - core utilities) (#4384 ) * first commit of cpp moe loadbalance code Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python bindings for moe load balance Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add python wrapper, ut and bug fixes Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add binding for layerId and update binding test Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add host tensor sharing and ut Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-20 17:53:48 +08:00
kanghui0204	6f3922f318	feat: Low Precision Allreduce for PCIe based GPU (#4344 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-20 06:53:46 +08:00
Yuxian Qiu	c8e062bfd3	fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. (#4399 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-19 14:25:36 -07:00
Perkz Zheng	1c5b0d6a13	[Feat] add chunked-attention kernels on Hopper (for llama4) (#4291 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add mtp for fmha_v2 MLA kernels and add chunked-attention support for hopper fmha kernels Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-05-19 09:57:10 -07:00
Faraz	7656af1b57	[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) (#4335 ) * add mixtral7x8b fp8 test with fixed cutlass fp8 moe gemm Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * update cutlass versions Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added internal cutlass with fix and docker update Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> * added mixtral to pro 6000 Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> --------- Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-05-19 08:56:21 -07:00
liji-nv	58e405624a	[https://nvbugs/5123103 ][fix] Fix torch compile for DeepSeekV3 (#3952 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-05-19 22:12:25 +08:00
Shi Xiaowei	df2798e0c3	feat: NIXL interface integration (#3934 ) NIXL interfaces Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-05-19 18:18:22 +08:00
Void	62bb7f9286	fix potential issues in allreduce fusion kernel and ut (#4226 ) fix allreduce fuison kernels and ut Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> --------- Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-19 17:38:29 +08:00
Jinyang Yuan	b618e1f55b	perf: Eliminate the need for attention DP padding when possible (#3439 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: raccoonliukai <raccoonliu@tencent.com>	2025-05-17 13:30:55 +08:00
Robin Kobus	4e370a509a	refactor: Copy sequence lengths once in decoder setup (#4102 ) * refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update DecoderInputBuffers to remove duplicated buffers - Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability. - Adjusted references in generateRequestOptions.cpp to align with the new buffer structure. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move getEmbeddingBias to anonymous namespace Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Filter context requests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: GenerateRequestOptions using more fine-grained functions - Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests. - Updated the `operator()` method to utilize the new method, improving code clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMDecoder - Updated the `generate_request_options` call. - Updated the `make_decoding_batch_input_output` call. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove const where we modify input buffers - Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications. - Updated related function calls to ensure compatibility with the new parameter types. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Copy sequence lengths once in decoder setup Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-16 22:03:55 +08:00
Nikita Korobov	fa3879629e	feat: TRT-LLM Gen integration for BMM and MoE refactoring (#4280 ) - Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator. - Refactors TRT-LLM Gen MoE runner to call to BMM interface - The accuracy is verified for DeepSeek R1 FP4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-05-16 13:31:53 +02:00
ixlmar	f7ad49bb9b	chore: improve log-level setting UX (#4352 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-05-16 09:47:44 +01:00
NVJiangShao	6cc3f2093a	Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow (#4348 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>	2025-05-16 10:02:30 +08:00
Erin	c44cf34373	fix: update checks that broke medusa tests when use_py_session=True (#4339 ) fix check Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-15 15:47:28 -07:00
yuxianq	4f8afe4cc6	feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-16 04:16:53 +08:00
yuxianq	0e87fcc228	refactor: use x is None instead of x == None. (#4244 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-15 20:00:04 +08:00
zhhuang-nv	97bc680cd8	feat: support kv cache reuse for MLA (#3571 ) * support kv cache reuse for MLA load compressed_kv and k_pe and do up-projection use 192/128 head size MLA context kernel support Blackwell and Hopper now Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * add CI test Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2 Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * use GPTJ style RoPE for MLA Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix rebase error and some docs Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix kv_lens Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * tiny fix Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix: use normal device memory instead of pinned memory for unit test Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> * fix L0 tests Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * fix torch compile after rebase Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments again Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com> Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com> Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-05-15 15:22:21 +08:00
Zhanrui Sun	5dc3b539ba	infra: Down the gcc toolset version from 13 to 11 (#4114 ) * Down the gcc toolset version from 13 to 11 Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update rocky8 images Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-05-15 11:08:51 +08:00
qsang-nv	0fd59d64ab	infra: open source fmha v2 kernels (#4185 ) * add fmha repo Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix code style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix header kernel_traits.h Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add .gitignore file Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * add SLIDING_WINDOW_ATTENTION Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix style Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * fix format Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update setup.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> * update build_wheel.py Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> --------- Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-05-15 10:56:34 +08:00
QI JUN	498ce8a056	Revert "feat: Low Precision Allreduce for PCIe based GPU" (#4340 ) Revert "feat: Low Precision Allreduce for PCIe based GPU (#3851)" This reverts commit `5e634dd1bd`.	2025-05-15 09:52:39 +08:00
hlu1	7fb0af9320	[fix] Remove stale cublas heuristics (#4326 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-14 17:35:51 -07:00
Robin Kobus	d31fefde2c	[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow (#4092 ) * chore: Remove GptSession/V1 from TRT workflow Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove stateful decoders Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession buffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession utils Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession kernels Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove V1 GPT models from tests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove gptSessionBenchmark from scripts and docs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove gptSession IO classes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession from test lists Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove GptSession from docs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove useless encoder test Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove mActualBatchSize from DecoderState Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove static batching from ExecutorTest - Updated `validateContextLogits` and `validateGenerationLogits` functions to remove the `batchingType` parameter. - Adjusted related test functions to reflect the changes in parameter lists. - Cleaned up the instantiation of test cases to eliminate unnecessary batchingType references. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-14 23:10:04 +02:00
Robin Kobus	c67da1fbaa	fix: Eagle decoding in TRT flow (#4229 ) * fix: EagleBuffers lifetime issue Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Clean up Eagle kernel parameters Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Eagle draft tokens init Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Add check for updated sequence length in TrtGptModelInflightBatching Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Skip check for beam search Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-14 16:10:49 +02:00
DylanChen-NV	206f82115d	[bug/5247505] fix: CP accuracy on Blackwell (#4188 ) * fix xqa params for cp Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * try adding B200 multi gpu test Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * add accuracy tests for cp Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> --------- Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-05-14 17:40:50 +08:00
kanghui0204	5e634dd1bd	feat: Low Precision Allreduce for PCIe based GPU (#3851 ) This PR adds a customized allreduce to TensorRT-LLM. The new allreduce is used for communication on PCIe-based GPUs via low-precision quantization, which can accelerate the PCIe allreduce process. Signed-off-by: Hui Kang <hkang@nvidia.com> Co-authored-by: Hui Kang <hkang@nvidia.com>	2025-05-14 16:45:43 +08:00
Barry Kang	20b42912ce	[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper (#4123 ) Support DeepSeek-R1 W4A8 on Hopper Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Co-authored-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-05-14 15:48:07 +08:00
Perkz Zheng	e8d7834c50	fix: [https://nvbugspro.nvidia.com/bug/5238626 ] illegal memory address when running llama 4 with cuda graph enabled (#4101 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-13 14:58:54 +08:00
pcastonguay	9643be5f20	[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156 ) * feat: Add per-request stats support with PyT backend Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Adding unit test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing stats unit test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing test with overlap Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-12 21:35:15 -04:00
Simeng Liu	286a789549	feat: Add heuristic for GroupRMSNorm kernel selection. (#4047 ) * feat: Add heuristic for GroupRMSNorm kernel selection. Implements a logistic regression model to dynamically select between: - GroupRMSNormBaseKernel: Allocates warps proportional to sum of dimensions (better SM occupancy in most cases) - GroupRMSNormLargeBatch: Allocates warps proportional to max dimension (better block scheduling in large batch scenarios) Selection heuristic considers batch size, allocated warps, and scheduling efficiency on the current GPU architecture. Models for Compute Capability 9.x and 10.x are trained base on nsys kernel runtime data. The default kernel selection is the base kernel. The python operator group_rms_norm will use the heuristic by default. User can pick to use the base or large batch kernels as well. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Address the comments. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-13 08:52:53 +08:00
wili	eba3623a54	Feat: Variable-Beam-Width-Search (VBWS) part4 (#3979 ) * feat/vbws-part4-v1.8: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * feat/vbws-part4-v1.9: fix incorrect output when using short output length Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.1: remove useless variables Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.2:fix incorrect output when using short output length Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.3: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.4: rebase Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> * v1.9.5: remove API change Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-05-12 22:32:29 +02:00
Yixin Dong	c90ebadd84	feat: Support the Structural Tag in guided decoding (#4066 ) * finish Signed-off-by: Ubospica <ubospica@gmail.com> * update Signed-off-by: Ubospica <ubospica@gmail.com> * update Signed-off-by: Ubospica <ubospica@gmail.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * exc overlap scheduler Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * add test Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix api ref Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Ubospica <ubospica@gmail.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-12 17:24:50 +08:00
Perkz Zheng	3f29d2f006	Feat: support exporting softmax statistics and update the kernel-selection heuristic (#4155 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * support exporting softmax statistics and update the kernel-selection heuristic Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-12 15:31:46 +08:00
zhhuang-nv	0a36db0aa4	[fix] trtllm-gen mla kernel warnings (#4119 ) fix trtllm-gen mla kernel warnings Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-09 20:21:28 +08:00
NVJiangShao	57b2fe2019	[#4085 ][fix] Fix `apply_per_channel_scale` for extremely large input sequence length. (#4089 ) Fix apply_per_channel_scale for extremely large input seq length. Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: crazy-JiangDongHua <759421566@qq.com>	2025-05-09 11:57:01 +08:00
Yi Zhang	91bf5e6a8e	[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804 ) Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>	2025-05-09 11:04:01 +08:00
Yukun He	5b61486d87	chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-09 10:20:41 +08:00
forrestl	9477661f4c	Support RingAttention in the BertAttention plugin and the DiT model (#3661 ) support ring attn for bert_attention plugin and dit model Signed-off-by: ChunhuanLin <lch_xdu@163.com>	2025-05-09 08:06:54 +08:00
chenfeiz0326	7f5716ef83	Cherry-pick trtllm-gen from feat/llama4 to main (#4086 ) * feat: TRT-LLM Gen FP8 MoE Llama4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * feat: TRT-LLM Gen llama4 MoE Top1 routing Signed-off-by: Jiqun Tu <jtu@nvidia.com> * feat: add per tensor FP8 TRT-LLM Gen GEMMs Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add guard for routingIndicesClusterKernel Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> --------- Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> Signed-off-by: Jiqun Tu <jtu@nvidia.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Co-authored-by: Nikita Korobov <nkorobov@nvidia.com> Co-authored-by: Jiqun Tu <jtu@nvidia.com>	2025-05-08 14:13:01 -07:00
Yukun He	bb7bcc75c2	feat: Fallback to NCCL for various patterns when input size is large. (#4080 ) * Fallback to NCCL for various patterns when input size is large. Move the previous implementation to cpp side. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Revising. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-08 11:13:13 -07:00
Simeng Liu	bb766eca0a	feat: Reduce branch overhead in groupRMSNorm kernels (#4067 ) * feat: Reduce branch overhead in groupRMSNorm kernels * Fix race condition with sm < 90 and avoid all threads in one warp writing to the same shared memory. Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-08 00:55:27 +08:00
Yan Chunwei	0c26059703	chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732 ) * beam_width and max_new_token Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove beam_width Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove min_length Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * remove return_num_sequences Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-05-07 13:20:25 +08:00
Chuang Zhu	09a28becae	fix cache buffer (#3942 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-07 09:49:44 +08:00
Daniel Cámpora	c56a2aca46	fix: Properly get decoding mode according to same logic as cpp. (#4026 ) * Properly get decoding mode according to same logic as cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Cross reference getDecodingMode implementations in pytorch - cpp. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Better bindings for DecodingMode. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert to version in main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Revert configuration.py. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-06 21:53:17 +08:00
Robin Kobus	72057a0a64	[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625 ) * disable overlap in encoder Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: overlap same batch Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Improve early exit Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 15:06:46 +02:00
Robin Kobus	e943ad5a2a	[https://nvbugs/5247414 ] fix: draft/target probs shape (#4055 ) Shape was wrongly changed in DecoderState introduction. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 09:56:43 +02:00
Yuan Tong	4b6c19737b	feat: support add internal cutlass kernels as subproject (#3658 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-06 11:35:07 +08:00
Mike Iovine	8caf200322	[fix] Skip debugCheckSemaphores in stream capture mode (#4032 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-05 10:24:10 -07:00
Robin Kobus	ccff86068e	fix: request termination in pipeline parallelism (#3892 ) * feat: Implement synchronous request termination in batch manager - Added `terminateRequestSync` method to `TrtEncoderModel` and `TrtGptModelInflightBatching` for handling request termination in the next `forwardSync` call. - Updated existing request termination logic to utilize the new synchronous method, ensuring generated tokens are cleared appropriately. - Enhanced logging for clarity in token management during request processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: MockedModelCancelRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: terminate with timeout Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * docs: Update doc string for allottedTimeMs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-05 21:51:41 +08:00
Robin Kobus	9f9edd783c	refactor: Introduce MpiTag enumeration and update MPI function signatures (#3893 ) * refactor: Move executor recv functions into classes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance MPI logging and error handling - Updated MPI logging to include destination and tag information for better traceability during send and receive operations. - Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests. - Improved code structure for clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce MpiTag enumeration and update MPI function signatures - Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability. - Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags. - Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity. - Removed redundant MPI tag constants from several classes, streamlining the code. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Rename tags for consistency Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-04 13:24:29 +02:00
Robin Kobus	403370af62	refactor: Move ModelSpec to core library (#3980 ) * refactor: Move ModelSpec from tests to core library Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move ModelSpec from runtime to separatedir Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use new bindings path and clean up Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Updated licenses Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove script_dir from path Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-04 01:39:09 +08:00
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Erin	83f37614ef	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 ) * support return logprob in llmapi Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> update and add test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> stability test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * revert removal of old flag Signed-off-by: Erin Ho <erinh@nvidia.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Erin Ho <erinh@nvidia.com>	2025-05-01 12:47:14 -04:00
YueWeng	b1621e8d4e	feat: add relaxed acceptance for DS (#3865 ) * add relaxed acceptance for DS R1 Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * clean and update docs Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * Modified based on review Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix mtp manager issue Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> --------- Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-01 21:50:36 +08:00
hlu1	1294ecb12f	Add attention workspace memory check (#3970 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-30 23:51:09 -07:00
Kate Cheng	7dbe618683	feat: Add multimodal embedding field in LlmRequest (#3855 ) * Add a new param to LlmRequest and Request to natively support mm Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * update comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update tests to match the new LlmRequest constructor parameters Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify unitTest and modify mm_embeding's dict name in llama4 Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix LlmRequest initialization in kvCacheManagerTest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up code for promt_tuning_config Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up prompt_tuning_config in GenerationRequest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:23:30 +08:00
Yukun He	9cc5922a0b	Clean up allreduce op in Deepseek V3 model. (#3829 ) * Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op. * Minor revision of moe_allreduce op argument names. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 07:56:36 +08:00
nv-guomingz	dd959de0fd	chore: update internal_cutlass_kernels. (#3973 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-04-30 22:13:17 +08:00
Ming Wei	ed887940d4	infra: open source XQA kernels (#3762 ) Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which consists of two parts: 1. NVRTC glue code 2. XQA kernel code During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via gen_cpp_header.py and passed to NVRTC for JIT compilation. Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>	2025-04-30 18:05:15 +08:00
Bo Li	a80d2373a3	fix: [https://nvbugspro.nvidia.com/bug/5243482 ] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. (#3862 ) * Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp. For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Run pre-commit. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com>	2025-04-30 14:27:38 +08:00
djns99	cc989ea49f	perf: Optimise MOE prologue to use fused setup function (#3790 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-04-30 11:44:48 +08:00
Pamela Peng	f98a80f9d9	sync internal cutlass kernel changes (#3968 ) Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>	2025-04-30 08:57:28 +08:00
xiweny	68a19a33d4	TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770 ) * upgrade cutlass to 3.9 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * update internal cutlass kernels Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * fix file Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * remove unnecessary change Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * update hash Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>	2025-04-29 11:19:11 -04:00
Dom Brown	8709fe8b53	chore: bump version to 0.19.0 (#3598 ) (#3841 ) test: add test cases for 0.19 release (#3608) * fix test name * add quickstart test for nemotron-ultra * add rcca multi-node test case for deepseek-v3 * add rcca info --------- squash (#3642) fix: nvbugs/5187237: fix deterministic mode crash (#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive * Revert "remove waive" This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac. * revert ar fusion --------- update fp8 doc (#3647) tests: change qa perf test to trtllm-bench (#3619) fix: FP8 quantized lm_head (NvBug 5214229) (#3567) infra: Add PR approval protection for the release branch (#3634) fix: nvbugs/5231298: pytorch allreduce issue (#3673) Fix: nvbugs/5222698 variable not defined (#3630) * Fix: nvbugs/5222698 variable not defined * Tidy code --------- test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685) test:restore fp8 kv cache testing for L0 (#3671) doc: Update DeepSeek perf docs (#3693) * Update DeepSeek perf docs * update * Apply suggestions from code review --------- tests: waive test_llm_multi_node (#3664) fix: update test_user_buffers_mm_add_prologue atol (#3711) Fix: cherry-pick hmac encryption from main branch (#3635) * security fix cherry-pick changes from main * fix hmac in remote mpi session (#3649) --------- Un-waive DS-V3-Lite tests. (#3621) fix: FP8 kv accuracy (#3675) * fix FP8 kv accuracy * update doc --------- Fix script options for engines. (#3622) unwaive multi-node test (#3721) chore : Split more tests out of gpt tests (#3524) (#3674) doc:add torch examples link into torch backend documentation (#3749) test: Get Eagle tests working (#3593) (#3722) Waive L0 test (#3756) waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656) Update ds v3 parameters in stress test. (#3676) waive gemma on L20 (#3766) https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. fix: PP4 fixes and cleanup (#3688) remove benchmark test list (#3643) skip disagg deepseek test if sm!=90 (#3720) test: skip failed cases on B200 (#3710) * add skip condition to tests * fix error --------- test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718) * skip_pre_ada for fp8 cases * update * update after rebase --------- add know issue to deepseek doc. (#3800) Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761) Waive L0 tests (#3826) fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. * Fix fused_moe fallback issue. (#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. --------- [doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797) Fix pre-commit Fix again Address some review comments for the MI Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-29 16:57:22 +08:00
zhhuang-nv	94e6167879	optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-04-29 14:17:07 +08:00
Perkz Zheng	35c5e4f1c5	feat: add CGA reduction fmha kernels on Blackwell. (#3763 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address the comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-04-29 10:43:54 +08:00
Jinyang Yuan	dafc28fb85	fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863 )	2025-04-29 09:09:43 +08:00
Yukun He	5502a522d2	Fixing minor typo in allreduce kernel selection (#3912 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-04-28 23:06:49 +08:00
Chuang Zhu	e2318756ed	cacheTransceiver buffer manager (#3798 ) * cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix args Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * cpp kvCacheManager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-27 11:48:15 +08:00
qixiang-99	ecd621fb0a	feat: Add head size 72 support for QKV Preprocessing kernel (#3743 ) * refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow - Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels. - Added support for head size 72 in unfused attention kernels(QKVPreprocessing). - Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72). Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * update: Waive head_dim=72 test cases and enhance test representation - Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues. - Introduced a custom __repr__ method in the Scenario class for pytest substring match. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-04-25 11:07:40 -07:00
dongxuy04	16535991b2	feat: Add MNNVL MoE A2A support (#3504 ) * add MNNVL memory mapping support Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments #2 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-04-25 17:29:08 +08:00
Shi Xiaowei	1d5178814b	Fix: Revert commit `25f9669` (#3832 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-04-24 14:03:20 +08:00
Julien Debache	0c6c8eaffd	fix: 5197419 and removed unused runtime kernels (#3631 ) - Removed kernel under test call, as it was not needed - Removed kernel itself - Removed kernel tests - Removed other unused kernels and their tests - Some static analysis clean up	2025-04-23 18:04:50 +02:00
Daniel Cámpora	1299f27c74	fix: Fix C++ decoder synchronization in PyTorch (#3106 ) * Use updateDecoderBuffers in python decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix synchronize in trtllm decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Enable by default. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use guided_decoder to setup seqslots and free them. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use always decode_async and update_requests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update decoder buffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix speculative decoding tests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Send new_tensors_host instead of assuming dict. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make default False in enable_trtllm_decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Partially fix mtp, partially fix py_executor. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update request states before sending disagg ctx cache. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix disagg test for torch decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make isend_tensor_list and recv_tensor_list for sending the tensors_host. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add disagg serving case to guided decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Get overlap scheduling to work. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update cutlass to main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update after rebasing. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update to use decode async and update requests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Properly pass information to update_requests Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make disaggregated serving a step closer to working. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase and format. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Copy new device tokens more pythonic. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Restore MTP add dummy reqs. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add ordereddict import to py_executor. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added seq slot manager. Add test. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use transmission for single tensor except when list of tensors is received. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add TRTLLMDecoder allocation to estimate max kv cache tokens. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add stream synchronization Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Format Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add decoder creation to estimate max kv. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update submodule UCXX inline with main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-04-23 23:55:27 +08:00
Shi Xiaowei	25f96697ad	fix: Intercept the error of multi-rank binding to a single card (#3525 )	2025-04-23 15:50:18 +08:00
Zongfei Jing	1e5af736ea	Add smart router for moe (#3641 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-23 12:21:59 +08:00
Perkz Zheng	0324a7389d	add QMMA-based MLA kernels (#3752 )	2025-04-23 10:18:19 +08:00
William Tambellini	44bff85e08	Fix double link to fp8_blockscale_gemm_src (#3707 ) Fix https://github.com/NVIDIA/TensorRT-LLM/issues/3690 Signed-off-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-23 10:16:07 +08:00
Zongfei Jing	7eee9a9d28	doc: Update doc for Deepseek min latency (#3717 ) * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Update doc for min latency deepseek Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Throw exception for RouterKernel when not running on sm90+ Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-22 23:07:59 +08:00
Yukun He	0ae7017342	Unify two versions of AllReduce custom op (#3032 ) * Rewrite unit test for unified allreduce op. Removing the legacy unit test. * Revise formats, fusion_op bindings. Put all tensors as optional inputs. * Move the MoeAllreduceOp to a separate custom op. * Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions. * Add more TODOs, fixing minor bugs, and remove legacy code. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-22 21:58:42 +08:00
Robin Kobus	8340657ae4	refactor: Introduce DecoderOutputBuffers per batch (#3506 ) * refactor: Restructure DecoderBuffers and DecoderStepAsyncSend - Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`. - Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity. - Refactored related methods to align with the new class structure, improving maintainability and readability of the code. These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend - Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`. - Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity. - Refactored related methods to align with the new class structure, improving maintainability and readability of the code. These changes enhance the structure and readability of the batch manager's decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Log DecodingMode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce DecoderOutputBuffers and update related classes - Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose. - Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic. - Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability. - Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces. These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update MPI communicator handling - Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity. - Updated related methods and constructors to reflect the new parameter type, enhancing maintainability. - Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`. These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Replace shared_ptr with unique_ptr for buffer management - Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`. - This change enhances memory management and ownership semantics, reducing overhead and improving performance. These modifications contribute to a cleaner and more efficient architecture in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-22 12:25:53 +08:00
Zheng Duan	ae48abefc1	bind block key and hasher (#3712 )	2025-04-21 18:50:57 +08:00
Iman Tabrizian	af04b6f6aa	bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095 ) * Fix hang bug when KV cache is low Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Review comments Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Fix attentiondp typo Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Add CI test for this case Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * fix: Fix the insertion order for responder futures Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * fix: Fix disagg CPP Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> --------- Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-04-21 15:16:55 +08:00
Jinyang Yuan	bc2b01d1dd	chore: update FMHA cubin files (#3680 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-21 15:04:04 +08:00
katec846	eeb605abd6	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 ) * Feat: Offload ptable to cpu if enable_chunk_context Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Feat: offload ptable to cpu for chunk context mode Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix and add comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update Readme for multimodal and add a new param mm_embedding_offloading Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * fix: Correct prompt table offloading condition in PromptTuningBuffers Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up the code Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Add commits to explain copy from cpu <-> gpu using pinned memory Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix namings based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix format based on precommit Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify --mm_embedding_offloading flag Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-21 14:31:01 +08:00
hlu1	31624b079a	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 ) * Add TRT-LLM Gen MOE to Deepseek fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix rebase issues Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix compile issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * CI clean Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Hao Lu <haolu@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-21 10:01:33 +08:00
Dom Brown	dbd9a83b0d	feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582 ) * feat: Integrate GPUDirect Storage (GDS) into Executor API Squash of several dev commits Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-18 15:59:21 +08:00
Yuan Tong	0b0e6d8a0a	refactor: Clean up CMakeLists.txt (#3479 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-18 14:39:29 +08:00
Jackch-NV	1b2b112d44	fix sage attention headsize check error in bertAttentionPlugin.cpp (#3660 ) Signed-off-by: Jackch-NV <69230184+Jackch-NV@users.noreply.github.com>	2025-04-18 09:28:04 +08:00
Netanel Haber	3c52ac098f	feat: allocate minimal blocks per window size (#3028 ) * implement variable window attention by breaking the block manager into window block managers per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * revert isCyclic to be true if the min attention window is reached, not per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add explanatory comment to mCyclicThreshold Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * load correct gemma config Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * if TYPE_CHECKING Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * pass dtype as well Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * test_gemma variable sliding window attention Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove \|\| mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * turn off request delaying for MaxUtil Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * make comments better Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * windowSizesTotalSum using std::accumulate Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comments Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove assert that kills disagg tests, since it isn't necessary Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add Gemma3 to SUPPORTED_HF_ARCHITECTURES Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * support Gemma3 Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix kvfactor field for deepseek Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix gemma-3 entries in testlist to include vswa Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * only quantize gemma2 VSWA Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> remove misleading comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix: disable KV cache reuse if using attention sink (#3021) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-17 16:04:57 +08:00
danielafrimi	0f084d9566	added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455 ) Loraop integration into torch modules Signed-off-by: Ubuntu <dafrimi@nvidia.com>	2025-04-17 12:48:27 +08:00
Void	950cadf2bd	add support for smaller hidden_dim (#3609 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-17 12:00:32 +08:00
Olya Kozlova	b3e6723dbc	feat: Adding FP8 BMM from Codegen (#3541 ) * Adding FP8 BMM from Codegen Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com> * Fixed licenses Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com> --------- Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com> Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com> Co-authored-by: Olya Kozlova <okozlova@6u1g-0018.nvidia.com> Co-authored-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com>	2025-04-16 10:37:15 +02:00
Robin Kobus	fffb403125	fix: disable KV cache reuse if using attention sink (#3021 ) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-16 03:07:32 +08:00
Kaiyu Xie	258ae9c58c	Revert "infra: move nvrtc_wrapper to conan (#3282 )" (#3573 ) This reverts commit `c0dd6cbce0`. Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-15 22:45:13 +08:00
jiahanc	1d3b98b920	perf: Optimize quantization kernels used in DeepSeek on Hopper (#3466 ) Signed-off-by: jiahanc <jiahanc@nvidia.com>	2025-04-15 17:49:57 +08:00
Robin Kobus	b7a38feb14	chore: Clean up cpp runtime (#3537 ) * add space in test output Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * perf: reduce executor lock scope Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move TokenRangeRetentionConfig implementation to cpp file Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Improve finished steps handling for external draft tokens - Fixed a bug where the whole finished steps tensor was being zeroes instead of the slices. - Replaced the creation of a temporary tensor for finished steps with a direct slice from the input tensor, improving efficiency and readability. - Updated the tensor management logic to streamline the process of setting zero values for finished steps during batch processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Clean up includes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-15 16:06:14 +08:00
shaharmor98	ede7058544	Feat/ Integrate peftCacheManager in PyExecutor creation (#3372 ) * integrate peftCacheManager in PyExecutor creation Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-15 15:14:43 +08:00
Pamela Peng	6cdfc54883	feat: Add FP8 support for SM 120 (#3248 ) * Allow FP8 on SM120 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix sm121 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix pre-commit Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * review update Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-04-14 16:05:41 -07:00
tburt-nv	c0dd6cbce0	infra: move nvrtc_wrapper to conan (#3282 ) * add pip scripts dir to path * move nvrtc_wrapper to conan * support building nvrtc wrapper from source --------- Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-15 05:31:01 +08:00
Robin Kobus	f58d4698c8	chore: Clean up cpp runtime (#3505 ) * chore: Remove unused tensors from DecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Remove unused argument from readme Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: remove unused tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove unnecessary newOutputTokens Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove unnecessary event in getDecoderSlotHostOutputs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-14 18:00:03 +08:00
yuxianq	9d64b6b890	Cache sin cos in model instead of global LRU cache. (#3378 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 11:19:09 +08:00
pcastonguay	fe6f14b2b1	fix: Fixing issue with first gen token being returned twice in streaming (#3427 ) * fix: Fixing issue with first gen token being returned twice with streaming Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing not_expectring_strings in test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-13 22:45:09 -04:00
William Tambellini	af67bf00a8	feat: register ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343 ) No change of default value (still ON). These were hidden cmake vars before that patch. Fix issue #3289 Signed-off-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-14 10:30:23 +08:00
Chuang Zhu	75e13f4f88	chore: disable some env for disagg defaultly (#3415 ) * disable some env for disagg defaultly Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * doc Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * remove Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-14 10:08:10 +08:00
Chuang Zhu	6ee021a90d	chore: exchange connection id with tagSend/tagRecv (#3320 ) * exchange connection id with tagSend/tagRecv Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * unwaive Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * tag recv/send Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-14 09:30:34 +08:00
Aurelien Chartier	7b38018fa0	feat: Add numNodes to ParallelConfig (#3346 ) * Add numNodes to ParallelConfig If not provided, attempt to find the number of nodes by adding the number of local ranks 0 Update device IDs check accordingly Signed-off-by: Aurelien Chartier <achartier@nvidia.com> * Add ParallelConfig pickle test Signed-off-by: Aurelien Chartier <achartier@nvidia.com> --------- Signed-off-by: Aurelien Chartier <achartier@nvidia.com>	2025-04-13 13:55:04 +02:00
Robin Kobus	ceec4924d9	refactor: batch slot management in decoder classes (#3300 ) * refactor: batch slot management in decoder classes - Changed `forwardBatchSlots` from a single `TensorPtr` to a `std::vector<TensorPtr>` in `decoderBuffers.h` and updated its initialization in `decoderBuffers.cpp`. - Updated `batchSlots` in `iGptDecoderBatched.h` to a `std::vector<TensorPtr>` for better handling of batch sizes. - Modified `mBatchSlotsDecoder` in `statefulGptDecoderBatched.h` to use a `std::vector<TensorPtr>` and adjusted its initialization in `statefulGptDecoderBatched.cpp`. - Ensured proper reshaping of tensors in the setup methods to accommodate the new vector structure. These changes enhance flexibility in managing tensor buffers across different batch sizes. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Setup batch slots outside of the decoder - Refactored batch slot management to utilize `makeBatchSlots`, enhancing clarity and functionality in batch processing. - Introduced `DecoderState` to `MakeDecodingBatchInputOutput` for improved state handling during decoding. - Updated the `operator()` method to include `decoderState` as a parameter, facilitating better integration with the decoding process. - Modified related tests to accommodate changes in batch slot handling and ensure proper functionality. These updates improve the overall structure and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance decoder input structure with maxDecodingEngineTokens - Updated the `Input` class in `iGptDecoderBatched.h` to include a new parameter `maxDecodingEngineTokens` for better control over decoding limits. - Modified the `MakeDecodingBatchInputOutput` algorithm to compute the maximum number of decoding tokens based on active slots. - Adjusted the `GptDecoderBatched` class to utilize the new `maxDecodingEngineTokens` parameter, improving clarity in token management during decoding. - Updated Python bindings to reflect changes in the `Input` class constructor. - Enhanced tests to ensure proper handling of the new parameter. These changes improve the flexibility and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Streamline decoder input creation and batch slot management - Introduced a new function `createDecoderInputs` to encapsulate the logic for creating decoder inputs, improving code organization. - Updated the `operator()` method to utilize the new `createDecoderInputs` function, simplifying the decoding input setup process. - Removed the `maxOfActiveSlots` template function to streamline the logic for determining the maximum number of active decoding engine tokens. - Introduced a direct calculation of `maxActiveDecodingEngineTokens` within the `createDecoderInputs` function, enhancing clarity and reducing complexity. These changes enhance the maintainability and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update logits handling in decoder batch - Modified the `decoder_batch::Input` to accept a vector of vectors for logits, enhancing flexibility in tensor management. - Adjusted the `createDecoderInputs` function to accommodate the new logits structure, ensuring proper batch processing. - Updated Python bindings to reflect changes in the `Input` class constructor, maintaining compatibility with existing interfaces. - Refactored the `GptDecoderBatched` and `StatefulGptDecoderBatched` classes to utilize the updated logits structure, improving clarity in tensor slicing and batch size management. - Enhanced tests to validate the new input structure and ensure correct functionality across various decoding scenarios. These changes streamline the decoding process and improve the overall maintainability of the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Rename maxDecodingEngineTokens to maxDecoderSteps - Updated the `Input` class in `iGptDecoderBatched.h` to rename `maxDecodingEngineTokens` to `maxDecoderSteps` for improved clarity. - Adjusted the `createDecoderInputs` function to reflect the new naming, ensuring consistency in the decoding process. - Modified the `GptDecoderBatched` class to utilize `maxDecoderSteps` in its logic, enhancing readability and maintainability. - Updated Python bindings to expose the renamed parameter, maintaining compatibility with existing interfaces. These changes enhance the clarity of the decoding parameters and improve the overall structure of the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of `active` vector from prepareForward Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Removed the `active` vector from `decoder_batch::Input` - Removed the `active` vector from the `Input` class constructor in `iGptDecoderBatched.h`, streamlining the input handling for decoding. - Updated the `createDecoderInputs` function and related tests to reflect the changes in the `Input` class, ensuring compatibility and maintaining functionality. - Adjusted Python bindings to accommodate the new constructor signature, enhancing clarity in the interface. These changes improve the maintainability and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of `active` vector from gptDecoderBatchedTest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Unify the creation of decoder batch inputs in algorithm and tests - Added a new static method `createDecoderBatchInputs` to streamline the creation of decoder batch inputs, enhancing clarity and maintainability. - Updated the implementation to utilize active slots directly, simplifying the logic for managing batch slots and logits. - Refactored the `operator()` method to leverage the new input creation function, ensuring compatibility with existing decoding processes. - Enhanced tests to validate the new input handling approach, ensuring correct functionality across various scenarios. These changes improve the overall structure and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of active vector from createDecoderBatchInputs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update maxDecoderSteps calculation - Replaced integer division with `common::ceilDiv` for calculating `maxDecoderSteps` and `numDecoderSteps`, ensuring correct handling of token counts. These changes enhance the robustness of the decoding batch input creation process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-13 05:05:13 +08:00

1 2 3 4 5 ...

391 Commits