TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Zongfei Jing	53163bf1df	[TRTLLM-6876][feat] Add low precision all2all for mnnvl (#7155 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-28 18:26:16 +08:00
dongxuy04	abdb2735be	[None][fix] Fix possible hang issue in WideEP and move some tests to pre-merge (#7262 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-08-27 01:39:24 -04:00
Jin Li	028235404b	[TRTLLM-6633][feat] Padding for piecewise cudagraph (#6750 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-08-26 18:31:33 -04:00
Void	040f4c70d3	[None][perf] Accelerate global scale calculations for deepEP fp4 combine (#7126 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-27 00:13:13 +08:00
Zhou Yuxin	f01101f687	[None][feat] Hopper Fp8 context mla (#7116 ) Signed-off-by: Yuxin <yuxinz@nvidia.com>	2025-08-26 17:10:20 +08:00
qixiang-99	b165f8bc97	fix/improve kvcache allocation in PyTorch runtime (#5933 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-08-26 12:40:22 +08:00
Bo Li	bf1b958f1a	[TRTLLM-7319][perf] Fuse slicing into MoE. (#6728 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com> Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-25 16:52:30 -04:00
Robin Kobus	31979aefac	[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests (#6754 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-24 20:53:17 +02:00
dongxuy04	19a0ea363b	[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP (#6973 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: Dongxu Yang <dongxuy@nvidia.com> Co-authored-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-08-24 08:15:29 -04:00
Robin Kobus	37543a9ad7	[None][refactor] Simplify decoder state initialization for speculative decoding (#6869 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-22 18:44:17 +02:00
Linda	898f37faa0	[None][feat] Enable nanobind as the default binding library (#6608 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-08-22 09:48:41 +02:00
dominicshanshan	6f245ec78b	[None][chore] Mass integration of release/1.0 (#6864 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: raayandhar <rdhar@nvidia.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Guoming Zhang <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com> Co-authored-by: Raayan Dhar <58057652+raayandhar@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-08-22 09:25:15 +08:00
Daniel Stokes	f7c597ec40	[None][perf] Make finalize fusion part of the tactic selection logic (#6915 ) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>	2025-08-21 14:08:03 -07:00
brb-nv	9a2b44d0f2	[None][chore] No-op changes to support context parallelism in disaggregated serving later (#7063 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-08-21 08:21:27 -07:00
Yuan Tong	90bfc8cc29	[https://nvbugs/5453827 ][fix] Fix RPATH of th_common shared library to find pip-installed NCCL (#6984 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-21 17:58:30 +08:00
ChristinaZ	c7269ea93a	[https://nvbugs/5392414 ] [fix] Add customized default routing method (#6818 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-08-21 16:58:41 +08:00
Yao Yao	cbcea33279	[fix]: use safeInitRowMax instead of fp32_lowest to avoid NaN (#7087 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-08-20 22:12:21 -07:00
Fan - Yunfan	41ff4901ee	[None][fix] Fix const modifier inconsistency in log function declaration/implementation (#6679 ) Signed-off-by: fanyunfan <2569548856@qq.com> Co-authored-by: fanyunfan <2569658856@qq.com> Co-authored-by: Yunfan Fan <46273019+fyf2016@users.noreply.github.com>	2025-08-21 11:08:11 +08:00
BatshevaBlack	9f51f8d20c	[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 (#7024 ) Signed-off-by: Batsheva Black <132911331+BatshevaBlack@users.noreply.github.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Bo Deng <deemod@nvidia.com>	2025-08-20 22:49:55 -04:00
Dom Brown	92daec1115	[TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper (#7035 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-08-20 10:11:25 -04:00
Yuhao Yao	8ac7dec623	[None][fix] Fix W4A8 MoE kernel issue (#7072 ) Signed-off-by: yuhyao <827623970@qq.com>	2025-08-20 06:52:47 -04:00
Yueh-Ting (eop) Chen	020fed97b6	[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse (#6767 ) This MR is a preliminary MR for implementing the SWA reuse mechanism for the kv cache manager. Please be aware that no functional change is intended in this merge request. The purpose of the clean-up is to decouple and remove existing functions for the up-coming SWA KV cache reuse change to be more natural and easier to review. Right now, (1) streamLLM, and (2) beam search with SWA, are broken. We do not want to complicate the code base by stacking more features upon something that does not work. This MR prunes out the logic and add assertions so we can come back and re-support the broken feature and remove the assertion. Since streamLLM (sink attention) is broken now, assertion is added under `KVCacheManager` ctor to guard for the value of `mSinkBlockTokenLength` and `mSinkBubbleLength`. Compute logics relate to it are pruned. The beam search with SWA will still be broke when introducing the SWA KV cache reuse. We will revisit this problem in the future. On top of this, we should make an effort to update the [supporting matrix](https://github.com/NVIDIA/TensorRT-LLM/blob/feat/1.0_doc_dev/docs/source/1.0/features/feature-combination-matrix.md) of the kv cache manager after merging the support of SWA KV cache reuse. Changes are listed as following: - Separate `KVCacheManager::updateToken` into `KVCacheManager::addToken` and `KVCacheManager::removeToken`. The functionality should be decoupled. - Push utility `cacheSequenceBlockOffsets` and `cacheNewBlockOffset` from `KVCacheManager` down to `WindowBlockManager`. `KVCacheManager`-exposed functions should be real utilities that users of the structure can leverage. Implementation-detailed function calls should not exist at this level. - Simplify "is shared last context block" logic under `KVCacheManager::addSequence`. Since no functional change is intended in this merge request, no test case is added. Several comments are added for future test coverage reminder. For `LlmRequestTest.ParamTest`, `streaming=True` is commented out because we guard sink attention with assertion now. In `capacitySchedulerTest`, `addToken` action to `crossKVCacheManager` is removed because in encoder-decoder model, generation tokens are added only to the decoder and not to the encoder. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-08-20 13:57:57 +08:00
zhhuang-nv	7e135d2ea7	[None][feat] Use Separate QKV Input Layout for Context MLA (#6538 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-08-19 22:04:48 +08:00
amitz-nv	a54c53652b	[TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call (#6968 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-08-19 15:39:56 +03:00
Zero Zeng	953f4fd69e	[None][fix] acceptance rate calculation fix in benchmark_serving (#6746 ) Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>	2025-08-19 17:29:36 +08:00
Martin Marciniszyn Mehringer	425dad01fd	[None][fix] Clean up linking to CUDA stub libraries in build_wheel.py (#6823 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-08-18 11:20:51 -04:00
ChristinaZ	55f4f2d80c	[None] [fix] Fix the macro name (#6983 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-08-18 03:08:32 -04:00
ChristinaZ	1e72721e8c	[None][feat] Add single block version renormalized routing kernel (#6756 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-08-17 13:47:13 +08:00
bhsueh_NV	85cbd0263b	[None][feat] Support Yarn on Qwen3 (#6785 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-08-17 07:21:29 +08:00
Fan - Yunfan	22d59a6f61	[None][fix] Using RAII to automatically manage the allocation and release of va_list for potential resource leak (#6758 ) Signed-off-by: fanyunfan <2569548856@qq.com> Co-authored-by: fanyunfan <2569658856@qq.com> Co-authored-by: Yunfan Fan <46273019+fyf2016@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-16 15:19:19 +08:00
Yuening Li	1f8ae2b2db	[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-08-15 17:15:49 -04:00
yifeizhang-c	4127d77678	[https://nvbugs/5394392 ][fix] Enlarge scheduler capacity under disagg bs == 1 (#6537 ) Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2025-08-15 09:52:06 -07:00
Perkz Zheng	6037fe3716	[https://nvbugs/5394685 ][fix] proper fix for the accuracy issue in 2CTA MLA kernels (#6941 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-15 23:29:36 +08:00
peaceh-nv	1c1d5d2495	[https://nvbugs/5451373 ][fix] : Fix the accuracy issue when using FP8 context MLA (#6881 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-08-15 16:53:56 +08:00
Yanchao Lu	3a987891d8	[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures (#6836 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-08-15 11:16:07 +08:00
Wanli Jiang	9a133e9b41	[https://nvbugs/5415862 ][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 (#6501 ) Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>	2025-08-15 11:10:59 +08:00
Yunfan Fan	11d08c33af	[None][fix] Fix responsibility boundary between the assert and tllmException files (#6723 ) Signed-off-by: fanyunfan <2569548856@qq.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-15 10:34:49 +08:00
Perkz Zheng	11d89a3732	[https://nvbugs/5394685 ][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue (#6896 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-15 08:51:04 +08:00
jmydurant	4200fa46d1	[None][feat] Add support for Hopper MLA chunked prefill (#6655 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-08-14 10:39:26 +08:00
Linda	eb4ed18a63	[None][fix] max_num_sequences argument in nanobind (#6862 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-08-13 19:16:17 -04:00
Perkz Zheng	58f7783ea4	[https://nvbugs/5394685 ][fix] the bug with spec-decoding + SWA && an accuracy issue related to 2CTA MLA (#6834 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-13 13:55:56 -07:00
Tin-Yin Lai	6c52bb07ff	[https://nvbugs/5302040 ][feat] Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527 ) Signed-off-by: tinyinl <tinyinl@nvidia.com>	2025-08-13 11:19:13 -07:00
Perkz Zheng	0fad6029f7	[TRTLLM-7093][fix] the perf regression to cvt_fp4 kernels (#6851 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-08-13 19:13:40 +08:00
Void	1d80df0955	[None][feat] DeepEP LL combine FP4 (#6822 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-08-13 04:20:21 -04:00
Zhou Yuxin	50e5e725e9	[https://nvbugs/5412456 ][fix] Fix an illegal instruction was encountered (#6776 ) Signed-off-by: Zhou Yuxin <yuxinz@nvidia.com>	2025-08-13 15:45:59 +08:00
Robin Kobus	45c7518032	[None][refactor] Simplify decoder state initialization (#6559 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-12 21:44:41 +02:00
QI JUN	8845e0f065	[None][fix] fix ci (#6814 )	2025-08-12 02:21:50 -07:00
Liao Lanyu	f7c13a4aa7	[TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp (#6745 ) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>	2025-08-12 16:45:16 +08:00
Sergey Klevtsov	27fc35175e	[None][feat] CUTLASS MoE FC2+Finalize fusion (#3294 ) Signed-off-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2025-08-12 15:56:48 +08:00
bhsueh_NV	83dbc6c75d	[TRTLLM-5532][feat] store the block of context request into kv cache (#6683 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-08-11 16:14:52 +08:00
Martin Marciniszyn Mehringer	9a8195ef88	fix: Ensure that Python stub generation works against libnvidia-ml stubs (#6188 ) Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>	2025-08-11 09:18:17 +02:00
Chuang Zhu	c566a8d2a2	[None][fix] fix same pp disagg (#6730 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-10 22:45:15 -04:00
Yueh-Ting (eop) Chen	199f306984	[None][chore][kv cache manager] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash (#6249 ) No functional change is intended in this MR. `WindowBlockManager::mCachedBlocksRoot` is now who is responsible for the bookkeeping of the `KVCacheBlock`, and the `mNextBlocks` is now the actual hash map that fetches the block. The `mEnableHashKey` knob and related hashing is removed. Signed-off-by: eopXD <yuehtingc@nvidia.com>	2025-08-10 09:10:10 -04:00
Ziyi Xiong	de472828b9	[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628 ) Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-08-09 23:15:04 +08:00
Chuang Zhu	e251f7c00b	[None][fix]revert kvcache transfer (#6709 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-08 07:18:53 -04:00
Zheng Duan	ebdc43e69d	[None][feat] move kv cache measure into transfer session (#6633 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-08-08 17:49:22 +08:00
NVJiangShao	2f2f5cc72c	[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE (#6231 ) Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>	2025-08-08 11:13:42 +08:00
Daniel Cámpora	efca359b66	[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default (#6216 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-08-07 22:19:37 -04:00
Iman Tabrizian	82276167e6	[None][feat] Add NCCL Symmetric Integration for All Reduce (#4500 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-08-07 17:28:14 -07:00
Yuan Tong	db8dc97b7b	[None][fix] Migrate to new cuda binding package name (#6700 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-07 16:29:55 -04:00
pcastonguay	453a06e6ab	[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-08-07 14:17:07 +02:00
Enwei Zhu	1b9781e8e7	[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-08-07 05:53:48 -04:00
peaceh-nv	8ec3b1de10	[None][feat] : Add FP8 context MLA support for SM120 (#6059 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-08-07 16:16:34 +08:00
hlu1	8207d5fd39	[None] [feat] Add model gpt-oss (#6645 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-08-07 03:04:18 -04:00
amitz-nv	85af62184b	[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-08-07 09:05:36 +03:00
Chuang Zhu	ee471df07c	[None][chore] optimize kv cache transfer for context TEP and gen DEP (#6657 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-07 11:36:05 +08:00
Zongfei Jing	0ff8df95b7	[https://nvbugs/5433581 ][fix] DeepGEMM installation on SBSA (#6588 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-08-06 16:44:21 +08:00
Ransiki	19b7524ff6	[None][feat] Add vLLM KV Pool support for XQA kernel (#6013 ) Signed-off-by: Ransiki Zhang <ransikiz@nvidia.com>	2025-08-06 09:29:37 +08:00
ixlmar	1ebceb790d	[TRTLLM-5508][feat] check input tokens + improve error handling (#5170 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-08-05 18:27:43 +01:00
Haohang Huang	c9eebcb454	[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379 ) Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com> Signed-off-by: symphonylyh <31998628+symphonylyh@users.noreply.github.com>	2025-08-05 07:47:41 +00:00
Chuang Zhu	4d040b50b7	[None][chore] ucx establish connection with zmq (#6090 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-05 02:50:45 -04:00
Olya Kozlova	13cc1c4878	[TRTLLM-5271][feat] best_of/n for pytorch workflow (#5997 ) Signed-off-by: Olya Kozlova <okozlova@nvidia.com>	2025-08-04 14:08:06 +02:00
Bruce-Lee-LY	8c82ee2803	[fix] xqa precision for fp16/bf16 kv cache (#6573 ) Signed-off-by: Bruce-Lee-LY <yong-li14@tsinghua.org.cn> Co-authored-by: Bruce-Lee-LY <yong-li14@tsinghua.org.cn>	2025-08-04 14:34:20 +08:00
Yuan Tong	a2f271c8e0	[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory (#5034 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-08-04 13:51:01 +08:00
Perkz Zheng	03430ed379	[https://nvbugspro.nvidia.com/bug/5415268 ] fix illegal smem access with chunked attention (#6401 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-08-04 11:19:58 +08:00
Jhao-Ting Chen	6edaa23c1c	[None][feat] Multi-block mode for Hopper spec dec XQA kernel (#4416 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-08-03 14:31:33 -07:00
Chuang Zhu	542f552d0b	use cudaSetDevice to create context ,fix nvbug 5394497 (#6403 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-08-03 13:32:55 -04:00
Robin Kobus	918fedf952	[None][refactor] Simplify finish reasons handling in DecoderState (#6524 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-02 07:17:43 +02:00
yunruis	a20ab5cbdb	[https://nvbugs/5381276 ][fix] fix warning for fused_a_gemm (#6402 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-08-01 09:37:21 -04:00
Yang Li	ac23f4a80d	[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops (#6515 ) Signed-off-by: Yang Li <56944310+yali-arch@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-08-01 15:59:09 +08:00
Robin Kobus	d3c14682f0	refactor: Remove unused buffers and bindings from sampler (#6484 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-08-01 00:43:03 -04:00
Yao Yao	942e080415	[fix] Fix missing fields in xqa kernel cache key (#6282 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-08-01 10:41:26 +08:00
Jaedeok Kim	fbee279909	fix: remove duplicate layer multiplication in KV cache size calculation (#6481 ) Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>	2025-07-31 22:34:34 -04:00
Michal Guzek	b8719fe96d	[nvbug/5374773] chore: Update nanobind with fail_fast_on_attention_window_too_large changes (#6491 ) Signed-off-by: Michal Guzek <mguzek@nvidia.com>	2025-07-31 20:25:29 +01:00
Enwei Zhu	4b299cb77e	feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-31 09:53:52 +08:00
pcastonguay	e7ae5e2824	feat: Add support for disaggregation with pp with pytorch backend (#6369 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Signed-off-by: raayandhar <rdhar@nvidia.com> Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Signed-off-by: pcastonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: raayandhar <rdhar@nvidia.com> Co-authored-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-07-30 09:42:13 -04:00
Zheng Duan	c9ed1ab436	[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure (#6135 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-30 10:39:40 +08:00
Yuan Tong	413a83ff80	fix: compatibility with CUDA < 12.9 on `__CUDA_ARCH_SPECIFIC__` macro (#5917 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-28 16:02:26 +08:00
Void	f172face98	DeepEP LL dispatch FP4 (#6296 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-28 11:25:42 +08:00
Yukun He	93a0fd0a23	[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. (#6205 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-07-28 09:36:26 +08:00
Jhao-Ting Chen	54f68287fc	fix precompiled multi_query_token kernel not having is_fp8_out hash key (#6279 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-25 20:45:53 -04:00
Michal Guzek	08d57123f9	[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974 ) Signed-off-by: moraxu <mguzek@nvidia.com>	2025-07-25 18:10:40 -04:00
liji-nv	e07fff4f78	[https://nvbugs/5340941 ] - fix: Correct custom ops used by Qwen3 Moe … (#6285 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-25 14:49:45 +08:00
Linda	9a99e6d6d7	fix: integration tests with nanobind (#6326 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-25 09:23:20 +08:00
Shiyu Li	375f74ecb2	[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. (#6237 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-07-25 08:01:40 +08:00
Bo Deng	ff72ca90de	Improve TransferAgentTest.SyncMessage (#6250 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-07-24 23:41:36 +08:00
Perkz Zheng	706f421cb0	[Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. (#6284 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-07-24 23:40:27 +08:00
Zhenhua Wang	62298bc473	perf: customize cublastLt algo for Llamba 3.3 70B TP4 (#6315 ) Signed-off-by: Zhenhua Wang <zhenhuaw@nvidia.com>	2025-07-24 23:01:15 +08:00
Zhou Yuxin	0ffcf9a863	Update fmhaRunner.cpp to fix guardwords scan error (#6327 ) Signed-off-by: Zhou Yuxin <yuxinz@nvidia.com>	2025-07-24 18:32:36 +08:00
Zhou Yuxin	fca13b8c95	hopper-style context MLA (#5713 ) Signed-off-by: Yuxin <yuxinz@nvidia.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Rashid K <rkaleem@nvidia.com> Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com> Signed-off-by: Po-Wei Wang (Vincent) <poweiw@nvidia.com> Signed-off-by: Netanel Haber <nhaber@nvidia.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Clay <ccs96307@gmail.com> Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Signed-off-by: Tailing Yuan <yuantailing@gmail.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com> Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com> Signed-off-by: Julien Debache <julien.debache@hotmail.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com> Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: JieXin Liang <Alcanderian@users.noreply.github.com> Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com> Signed-off-by: Yegor Yershov <yegor6741@gmail.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: raayandhar <rdhar@nvidia.com> Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: xsimmons <xsimmons@nvidia.com> Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com> Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> Signed-off-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Signed-off-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com> Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Signed-off-by: narutolhy <582909902@qq.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> Signed-off-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: WeiHaocheng <20514172+WeiHaocheng@users.noreply.github.com> Co-authored-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com> Co-authored-by: Zhenhuan Chen <chenzhh3671@gmail.com> Co-authored-by: Po-Wei (Vincent) <poweiw@nvidia.com> Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Co-authored-by: Neta Zmora <nzmora@nvidia.com> Co-authored-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: Clay <ccs96307@gmail.com> Co-authored-by: Venky <23023424+venkywonka@users.noreply.github.com> Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com> Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: Linda <57756729+Linda-Stadter@users.noreply.github.com> Co-authored-by: Shunkangz <182541032+Shunkangz@users.noreply.github.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Tailing Yuan <yuantailing@gmail.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: ixlmar <206748156+ixlmar@users.noreply.github.com> Co-authored-by: HuiGao-NV <huig@nvidia.com> Co-authored-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: Stefan Niebler <82932102+stnie@users.noreply.github.com> Co-authored-by: jthomson04 <jwillthomson19@gmail.com> Co-authored-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Julien Debache <jdebache@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com> Co-authored-by: Daniel Stokes <40156487+djns99@users.noreply.github.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: DylanChen-NV <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Co-authored-by: davidclark-nv <215764518+davidclark-nv@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: liji-nv <59594262+liji-nv@users.noreply.github.com> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: xiweny <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Yegor <75512761+Wokzy@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Raayan Dhar <58057652+raayandhar@users.noreply.github.com> Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: xavier-nvidia <xsimmons@nvidia.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com> Co-authored-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: chenfeiz0326 <chenfeiz@nvidia.com> Co-authored-by: dongxuy04 <78518666+dongxuy04@users.noreply.github.com> Co-authored-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com> Co-authored-by: Hanjun Cho <46752251+gkswns0531@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@ip-10-0-20-146.us-west-2.compute.internal> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com> Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com> Co-authored-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com> Co-authored-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: wili <98001977+wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: Void <18275976+yilin-void@users.noreply.github.com> Co-authored-by: William Tambellini <wtambellini@sdl.com>	2025-07-23 14:37:20 +08:00
Perkz Zheng	2193ad3aac	[https://nvbugs/5387771 ] fix deadlocks due to insufficient numSemaphores (#6262 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-07-23 11:20:55 +08:00
Linda	60073731ca	fix: bindings unit tests for nanobind (#6221 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-22 14:51:43 +01:00
WeiHaocheng	fddb7f1141	feat: moe prepare support topk % 4 != 0 (#5742 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-07-22 10:42:46 +08:00
Chang Liu	7381f1dba7	[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models (#5444 ) Only supports qwen in this PR	2025-07-21 16:11:58 -07:00
Linda	3efad2e58c	feat: nanobind bindings (#6185 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-21 08:56:57 +01:00
Yuening Li	e8c068b4b1	[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow (#5850 ) Signed-off-by: Yuening Li <62227368+yueningl@users.noreply.github.com> Co-authored-by: Yuening Li <62227368+yueningl@users.noreply.github.com>	2025-07-21 15:17:35 +08:00
danielafrimi	5300a99bd8	W4A8 GEMM (#6005 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-20 17:34:57 +03:00
amitz-nv	98428f330e	[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616 ) Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>	2025-07-20 08:00:14 +03:00
Martin Marciniszyn Mehringer	943fd418dd	fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings (#6189 ) Signed-off-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>	2025-07-20 10:38:51 +08:00
bhsueh_NV	2e14c8f443	[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 (#6065 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-07-20 10:25:25 +08:00
Void	118307c224	DeepEP LL support variable hidden size and tokens num (#6141 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-20 09:32:41 +08:00
Ziyi Xiong	66030ef815	[TRTLLM-6452][feat]: Two-model engine KV cache reuse support (#6133 ) Signed-off-by: ziyixiong-nv <fxiong@nvidia.com> Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>	2025-07-19 13:17:15 +08:00
Bo Deng	0388ff9083	[https://nvbugs/5393961 ][fix] record kv-cache size in MLACacheFormatter (#6181 ) Signed-off-by: Bo Deng <deemod@nvidia.com>	2025-07-19 05:06:45 +08:00
Stefan Niebler	d475c97c82	[nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound (#5926 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-19 01:54:51 +08:00
Stefan Niebler	6d7874a467	[nvbugs/5369799] fix: Update disaggregation handling in sampler (#5762 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-19 01:40:46 +08:00
Robin Kobus	ec2b953e7e	refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-18 12:12:08 +02:00
QI JUN	a95f31e72a	chore: add more log in FmhaDispatcher (#6170 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-18 16:53:02 +08:00
xavier-nvidia	200ea9ee81	fix TMA error with GEMM+AR on TP=2 (#6075 ) Signed-off-by: Xavier Simmons <xsimmons@nvidia.com>	2025-07-18 10:26:08 +08:00
yifeizhang-c	0155e7a3a1	[TRTLLM-6368] Update deepep dispatch API (#6037 ) Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>	2025-07-18 10:13:31 +08:00
Iman Tabrizian	b75e53ab69	Revert "feat: nanobind bindings (#5961 )" (#6160 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-07-18 10:12:54 +08:00
Daniel Stokes	ae28b3a664	feat: Add support for benchmarking individual gemms in MOE benchmark (#6080 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-18 09:00:12 +12:00
Linda	5bff317abf	feat: nanobind bindings (#5961 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-17 22:42:52 +08:00
Enwei Zhu	21efb50068	[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-17 17:46:10 +08:00
Chuang Zhu	44c70c88f9	chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-07-17 17:42:07 +08:00
ChristinaZ	7e033c392e	Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-17 12:38:29 +08:00
Shiyu Li	6e1aee6fd6	[fix] Performance Optimization for MNNVL TwoShot Kernel (#5934 ) Signed-off-by: Shiyu Li <shili@nvidia.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-07-17 10:49:51 +08:00
qixiang-99	e09e409dfb	Fix: Enhance ModelConfig for kv cache size calculations (#5868 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-16 14:41:31 -07:00
qsang-nv	8ef8e73002	update spec_dec (#6079 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-07-16 17:50:43 +08:00
Tomer Shmilovich	0552a02943	BlockManager copy constructor fix (#5982 ) Signed-off-by: Tomer Shmilovich <tshmilovich@nvidia.com>	2025-07-16 17:33:17 +08:00
Bo Deng	ec3ebae43e	[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 (#5991 ) Signed-off-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com> Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Bo Deng <deemod@nvidia.com> Co-authored-by: Rabia Loulou <174243936+rabial-nv@users.noreply.github.com> Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-07-16 13:54:42 +08:00
Zheng Duan	38db4bc7fb	feat: use session abstraction in data transceiver and cache formatter (#5611 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-07-16 13:52:44 +08:00
Jinyang Yuan	e761231c0b	[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-16 00:25:32 +09:00
Daniel Stokes	dd2491f47d	fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-15 13:40:42 +12:00
Daniel Stokes	f277afdd93	perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-14 14:04:15 -07:00
Robin Kobus	6d4b045d1f	refactor: Remove enforced sorted order of batch slots (#3502 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-14 17:23:02 +02:00
Perkz Zheng	4a0b7a0cf1	[https://nvbugspro.nvidia.com/bug/5355054 ] fallback to cubins for fp8 fmha kernels on Ada. (#5779 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: qsang-nv <200703406+qsang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Yi Zhang	9cc4e5d50e	[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation (#5463 ) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: yizhan <187001205+yizhang-nv@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
Dom Brown	afaa388bee	[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access (#5676 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-14 17:17:30 +08:00
dongxuy04	c04570a506	Use huge page mapping for host accessible memory on GB200 (#5963 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-07-14 16:11:04 +08:00
Enwei Zhu	ed77ef2ff4	fix: Fix MoE benchmark (#5966 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-14 15:17:26 +09:00
Yuan Tong	a36ac45c4d	fix: fast redux detection in trtllm gen routing kernel (#5941 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-13 16:35:07 +08:00
Enwei Zhu	bc1d4fb5da	[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-07-12 15:50:31 +09:00
ChristinaZ	c5fb692a7d	Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-11 16:37:56 +08:00
Zhihan Jiang	682acd40da	[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698 ) (#5925 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-07-11 07:51:43 +08:00
Linda	4d071eb2d1	feat: binding type build argument (pybind, nanobind) (#5802 ) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-07-11 00:48:50 +09:00
narutolhy	41ef1ade19	feat:enable kvcache to be reused during request generation (#4028 ) Signed-off-by: narutolhy <582909902@qq.com>	2025-07-10 22:18:01 +09:00
Jinyang Yuan	8b9a030a5c	[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-07-10 20:07:32 +09:00
CarstyYou	dc32f9ae73	[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531 ) Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>	2025-07-10 15:16:18 +08:00
Anthony Chang	7d21b55b5a	[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-07-10 14:06:50 +08:00
QI JUN	e289a98d5a	avoid nesting NCCL group in allgather and reduce scatter OPs (#5866 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-07-10 12:32:59 +09:00
peaceh-nv	76c3a12bcb	[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 (#5636 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-10 09:20:30 +08:00
DylanChen-NV	74dca0aa7b	[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-09 23:16:42 +08:00
peaceh-nv	52684d79f7	Fix : fix moe regression for sm120 (#5823 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-07-09 21:25:11 +08:00
Dom Brown	3e3b1769ad	[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-07-09 08:21:58 +01:00
Jhao-Ting Chen	e4c777df7d	Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-09 09:26:27 +08:00
xavier-nvidia	b6013da198	Fix GEMM+AR fusion on blackwell (#5563 ) Signed-off-by: xsimmons <xsimmons@nvidia.com>	2025-07-09 08:48:47 +08:00
Pamela Peng	da8c7372d4	[TRTLLM-5366][feat]Add support for sm121 (#5524 ) Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Initial CI run failed a single step A30-CPP-3 due to timeout. Rerunning that step succeeded.	2025-07-08 14:27:00 -07:00
Tailing Yuan	ba0aea1da6	Fix a quote error introduced in #5534 (#5816 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-08 18:48:32 +08:00
xiweny	eaf8bec88b	fix: Disaggregate serving with attention DP (#4993 ) Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-07-08 16:15:03 +08:00
JieXin Liang	664bf95892	[fix] improve fp4_block_scale_moe_runner type check (#5681 ) Signed-off-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: ChristinaZ <83400082+ChristinaZ@users.noreply.github.com>	2025-07-08 14:32:14 +09:00
davidclark-nv	a1235ee978	[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces (#5743 ) Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-07-07 13:34:55 -07:00
Tailing Yuan	85b4a6808d	Refactor: move DeepEP from Docker images to wheel building (#5534 ) Signed-off-by: Tailing Yuan <yuantailing@gmail.com>	2025-07-07 22:57:03 +09:00
Daniel Cámpora	1260e2f33f	feat: Optimize TRTLLM Sampler perf single beam single step (#5550 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-07-07 15:44:47 +02:00
DylanChen-NV	5ca2b9bb15	[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615 ) Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-07-07 18:04:57 +08:00
ChristinaZ	12d8c7d129	Refactor the topk parallelization part for the routing kernels (#5567 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-07-07 15:53:25 +08:00
Daniel Stokes	ec6c7dff1a	feat: Add support for MXFP8xMXFP4 in pytorch (#5535 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-07-06 15:32:06 -07:00
Robin Kobus	ae27261094	refactor: decoding inputs (#5679 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-06 08:21:02 +02:00
Julien Debache	6bddaf6df6	chore: Improve documentation of Kv_block_array (#5765 ) Signed-off-by: Julien Debache <julien.debache@hotmail.com>	2025-07-05 22:25:27 +02:00
jthomson04	1b588f8390	feat: KV events for sliding window attention (#5580 ) Signed-off-by: jthomson04 <jwillthomson19@gmail.com>	2025-07-05 06:05:20 +08:00
Stefan Niebler	d1112aac37	[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333 ) Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>	2025-07-05 01:35:13 +09:00
Chuang Zhu	ffc0b8f5da	Cache transceiver support VSWA (#5505 ) Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-07-05 01:18:42 +09:00
Faraz	81c0764012	Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 (#5724 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-07-04 16:53:20 +09:00
Robin Kobus	07f9cf1519	fix: Improve chunking test and skip empty kernel calls (#5710 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-04 09:08:15 +02:00
Yuan Tong	32b244af38	feat: reduce unnecessary kernel generation (#5476 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-07-04 14:37:49 +08:00
Netanel Haber	134b2383ff	[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window (#5720 ) Signed-off-by: Netanel Haber <nhaber@nvidia.com>	2025-07-04 08:16:25 +02:00
Robin Kobus	1a3bd140ed	chore: Remove unused isFullContextRequest method (#5666 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-03 15:08:09 +02:00
WeiHaocheng	dccbfc8b1e	fix: Set init value for moe expert id (#5660 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-07-03 07:05:31 -04:00
Jhao-Ting Chen	77082cde38	[https://nvbugspro.nvidia.com/bug/5329655 ] [feat] Pytorch path add spec dec param to attention op (#5146 ) Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>	2025-07-02 04:54:43 -04:00
Robin Kobus	4cd8543d8c	[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest (#5489 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-02 10:13:31 +02:00
qixiang-99	ca7b6ec8d8	Feat/pytorch vswa kvcachemanager (#5151 ) Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-07-02 15:58:00 +08:00
Xiaowei Wang	32dfdfba30	feat: fuse w4a8 moe pre-quant scale on Hopper (#5613 ) Signed-off-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>	2025-07-01 23:02:41 -04:00
Void	7992869798	perf: better heuristic for allreduce (#5432 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-07-01 22:56:06 -04:00
liji-nv	c345f5876c	[feat] Support torch compile for attention dp (#5086 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-07-01 13:48:52 -04:00
Robin Kobus	d68fa728d8	refactor: Clean up DecodingInput and DecodingOutput (#5617 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-01 14:31:42 +02:00
Yan Chunwei	a5eff139f1	[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) (#5431 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-07-01 19:06:41 +08:00
杨凯旋	61c5a53642	[#5403 ][perf] Conditionally enable SWAP AB for speculative decoding (#5404 ) Signed-off-by: zoheth <z0heth@outlook.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-07-01 18:32:37 +08:00
Robin Kobus	5f77d212ef	test: Reduce number of C++ test cases (#5437 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-07-01 09:40:49 +02:00
danielafrimi	7a617ad1fe	feat: W4A16 GEMM (#4232 ) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>	2025-07-01 10:36:05 +03:00
Li Min	16fc99391f	refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code (#5557 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-30 08:48:04 -07:00
Robin Kobus	9bdc5951f8	refactor: decoder state setup (#5093 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-30 11:09:43 +02:00
WeiHaocheng	42a9385d02	[TRTLLM-5331] perf: Replace allgaher with AllToAllPrepare (#5570 ) Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>	2025-06-30 13:06:09 +08:00
Cheng Hang	64db7d27f6	[feat] Optimizations on weight-only batched gemv kernel (#5420 ) Signed-off-by: Cheng Hang <chang@nvidia.com>	2025-06-30 10:20:16 +08:00
Enwei Zhu	b4dab23e7b	[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP (#5435 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-30 01:02:07 +08:00
Li Min	6021a439ab	Make moe permute and final as custom op (#5412 ) Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>	2025-06-27 15:48:33 -07:00
Daniel Stokes	5773cfdcf2	feat: Add support for per expert activation scaling factors (#5013 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-28 09:10:35 +12:00
Darragh Hanley	5437075def	ReDrafter support for Qwen (#4875 ) Signed-off-by: darraghdog <darragh.hanley@gmail.com> Signed-off-by: Darragh Hanley <darragh.hanley@gmail.com> Co-authored-by: rakib-hasan <rhasan@nvidia.com>	2025-06-28 02:33:10 +08:00
Robin Kobus	a8141a4513	refactor: Speculative decoding buffers part 2 (#5316 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-27 17:41:48 +02:00
Aurelien Chartier	833c0dea4a	[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-27 17:03:05 +02:00
wili	56cdfe5c6c	[TRTLLM-5000][feat] NGrams V2 (#4569 ) Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>	2025-06-27 23:00:17 +08:00
peaceh-nv	cb58073ab7	Fix : fix build for sm120 (#5265 ) Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>	2025-06-27 20:42:47 +08:00
ChristinaZ	a608b00d38	Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-27 20:17:40 +08:00
Daniel Stokes	83a1f60556	feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-27 12:29:34 +08:00
Tailing Yuan	ef43b95aa1	Fix execute_process: check results using EQUAL (#5481 )	2025-06-27 11:57:04 +08:00
Anthony Chang	de7cd0de05	fix: MoE autotune fallback failed to query default heuristic (#5520 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-26 17:28:48 +01:00
jmydurant	8836990bde	[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) (#5475 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 22:18:08 +08:00
Robin Kobus	8dfa31c71d	refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 19:45:52 +08:00
Yao Yao	0788c5d0d6	[perf] improve XQA-MLA perf (#5468 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-26 18:09:13 +08:00
Bo Li	1bab9000a6	perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-26 14:03:56 +08:00
Alessio Netti	7e681fbe52	[chore] Allow configuring linking of NVRTC wrapper (#5189 ) Signed-off-by: Alessio Netti <netti.alessio@gmail.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-26 07:26:10 +02:00
dongxuy04	490d2e5819	feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-25 22:25:13 -07:00
Daniel Stokes	942841417e	opensource: Opensource MOE MXFP8-MXFP4 implementation (#5222 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-26 12:18:19 +08:00
qsang-nv	e9cd810071	keep sm90 headsize 128 cubins (#5320 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-26 12:14:01 +08:00
ChristinaZ	d135f5993d	Add unit test for routing kernels (#5405 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-26 09:49:11 +08:00
jmydurant	578dbc8d9a	feat: chunked prefill for MLA (Blackwell) (#4651 ) Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>	2025-06-26 09:01:00 +08:00
Perkz Zheng	1f292ff2a0	[https://jirasw.nvidia.com/browse/TRTLLM-4645 ] support mutliCtasKvMode for high-throughput MLA kernels (#5426 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-25 16:31:10 +08:00
Enwei Zhu	fc7a81ceb0	test: Add LLGuidance test and refine guided decoding (#5348 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-25 14:12:56 +08:00
Robin Kobus	e2a8cbc80b	refactor: manage cache indirection in decoder state (#5315 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-24 09:15:59 +02:00
Robin Kobus	b3045c44b9	refactor: remove TrtGptModelOptionalParams (#5165 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-20 10:31:40 +02:00
dongxuy04	4f0f17ac8a	feat: Misc Opt for large scale EP (#5374 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-20 13:11:31 +08:00
Fanrong Li	5d4ab47d5b	fix: refactor and fix mtp vanilla (#4762 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-20 05:23:39 +08:00
Kaiyu Xie	113f6fbadd	Fix: missing clientId when serialize and deserialize response (#5231 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-06-19 23:05:11 +08:00
Fanrong Li	c7af650d5a	Fix: fix the deterministic issue in the MTP Eagle path (#5285 ) Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-06-19 18:08:40 +08:00
yunruis	b3e886074e	Fix CI build time increase (#5337 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-19 13:49:42 +08:00
jellysnack	0623ffe3bc	feat: Add LLGuidance Support for PyTorch Backend (#5214 ) Signed-off-by: jellysnack <oleg.jellysnack@gmail.com> Signed-off-by: jellysnack <158609015+jellysnack@users.noreply.github.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-18 19:33:34 +08:00
Bo Li	d76bda7f2c	chore: Refine printed info of CHECK_TYPE. (#5295 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-18 15:35:41 +08:00
Yukun He	6711ad9cf3	[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. (#5139 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-06-18 14:33:46 +08:00
Yao Yao	908463a5f5	[feat]: improve performance of XQA-MLA for sm120 (#5087 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-18 14:19:22 +08:00
Robin Kobus	627062c265	refactor: Update decoder buffer and logits management (#4450 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-18 08:10:32 +08:00
Emma Qiao	ff32caf4d7	[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885 ) Signed-off-by: qqiao <qqiao@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Co-authored-by: Erin Ho <14718778+hchings@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-06-17 23:48:34 +08:00
qsang-nv	5236bb9084	delete cubins (#5274 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 22:10:49 +08:00
QI JUN	f899c4d294	Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224 ) Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-17 21:28:09 +08:00
Dom Brown	44fb3c1673	[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207 ) - Adds a new Python custom op (fp8_block_scale_moe_runner) and a FP8BlockScaleMoERunner class for autotuning. - Updates C++ MoE and batched GEMM kernels to accept a configIndex for workspace sizing and execution. - Extends the unit test to run both autotuned and non-autotuned code paths. Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-17 21:01:56 +08:00
amirkl94	8451a87742	chore: Mass integration of release/0.20 (#5082 ) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com> Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>	2025-06-17 14:32:02 +03:00
liji-nv	13eef642e6	[feat] Piecewise cuda graph support for MLA (#4467 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-17 18:58:38 +08:00
Robin Kobus	dc3861b4aa	refactor: Unify decoder test with e2e worklfow (#5239 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-17 12:04:58 +02:00
qsang-nv	faca19c2f0	update setup.py for special cases (#5227 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 16:41:07 +08:00
qsang-nv	134cb66a53	fix mla test (#5240 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-17 15:26:25 +08:00
Enwei Zhu	4b82b8b4c7	[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP (#5215 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-17 15:23:24 +08:00
Tracin	a2e8ae1120	Update internal cutlass commit. (#5228 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-17 10:47:45 +08:00
Robin Kobus	b6ca677741	refactor: remove decoder request from decoder interface (#5129 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 09:12:30 +02:00
Anthony Chang	4f9fa9f21d	feat: MoE trtllm backend kernel update (#5183 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-16 14:46:13 +08:00
Chuang Zhu	1d2b0d3d80	use file lock to avoid port conflict (#5123 )	2025-06-16 14:15:37 +08:00
Robin Kobus	dda64166cd	refactor: Scheduling based on KV cache state (#4865 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-16 08:14:58 +02:00
Tracin	ef3fdc8051	feat: Add w4a8_mxfp4_fp8 quantization recipe. (#4867 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>	2025-06-16 11:30:57 +08:00
qsang-nv	5a01ba5260	use cu for fmha_v2 (#4694 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-15 18:40:44 +08:00
Aurelien Chartier	1389f5a4d3	feat: Add support for fp8 rowwise quantization (#4876 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com> Co-authored-by: aikitoria <151776613+aikitoria@users.noreply.github.com>	2025-06-14 06:37:48 -07:00
Robin Kobus	443b2eb51f	refactor: Speculative decoding buffers (#5091 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-14 11:39:32 +02:00
yunruis	b99c5ce8c1	Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL (#4560 ) Signed-off-by: yunruis <yunruis@nvidia.com> Signed-off-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com> Signed-off-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com> Co-authored-by: kduan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-06-14 17:36:22 +08:00
dongxuy04	97657bfda2	optimize memset before alltoall communication (#5188 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-14 10:49:47 +08:00
Perkz Zheng	3d87770e15	[https://nvbugspro.nvidia.com/bug/5295470 ] support headDim 256 for blackwell fmha kernels (#5164 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-13 23:01:01 +08:00
Chuang Zhu	8e9937081d	ucxx only use ucp_feature_tag to aviod some issuse on some platform (#4994 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-13 19:14:25 +08:00
yunruis	e5be3a95b3	fix: fix license bug (#5200 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>	2025-06-13 18:58:15 +08:00
yunruis	30c5b4183a	refactoring: port customized kernels with public cutlass version (#5027 ) Signed-off-by: yunruis Merge this to unblock others since the full CI has been run through	2025-06-13 16:19:31 +08:00
Yao Yao	12e075eb70	[nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime (#5133 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-06-13 15:53:29 +08:00
Matthias Jouanneaux	514baf1287	[fix] Fix comment to pass guardwords check (#5191 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-06-13 15:49:59 +08:00
zhhuang-nv	a891013e3c	[feat] Optimize KV Cache Reuse for MLA (#4869 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-06-13 11:03:05 +08:00
Matthias Jouanneaux	a0b6c635b1	[feat] trtllmGen MoE routing: added support for top groups and top K bounds (#4063 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Co-authored-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>	2025-06-13 06:00:02 +08:00
Xiaodong (Vincent) Huang	cc2a1344be	None: fix OOM because of unnecessary mha workspace (#5056 ) Signed-off-by: Vincent Huang <vincenth@nvidia.com>	2025-06-12 21:56:05 +02:00
liji-nv	10ab9791ec	[fix] Do not reuse dummy request KVCache (#4804 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-12 15:24:50 +08:00
Netanel Haber	e692779ead	Solve underallocation in VSWA+/VGQA (#4667 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-12 12:12:46 +08:00
HuiGao-NV	43192379af	Use backend to replace macro to control enablement of MNNVL all reduce (#4635 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-06-12 11:22:49 +08:00
Zheng Duan	ee44fa00f8	chore: rename IOFormatter to BaseCacheFormatter (#5068 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-12 10:50:14 +08:00
Bo Li	1b79041f5d	fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264 ) Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-06-11 09:38:10 +08:00
Tracin	6c91f1c7ac	Mxfp8xmxfp4 quant mode(#4978 ) Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-06-10 22:01:37 +08:00
Zongfei Jing	6d1f2d0fd7	[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion (#4756 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-06-10 19:55:16 +08:00
Aurelien Chartier	dcf72c6ad3	chore: cleanup GDS Cmake interface (#4928 ) Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-06-10 17:25:43 +08:00
dongxuy04	7137cc8f67	fix cuda driver link issue with driver version less than 12.3 (#5025 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-06-10 15:27:39 +08:00
pcastonguay	87c56ab024	perf: Removing initializing ptuning buffers to zero (#4915 ) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-06-09 21:57:21 -04:00
Daniel Cámpora	d68b8180d3	feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-10 07:28:34 +08:00
Chang Liu	f70815c945	[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145 ) Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-06-10 01:59:56 +08:00
Dom Brown	9c012d5bf8	[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner (#4872 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-06-09 11:02:48 +01:00
liji-nv	1d4f748773	[fix] Fix illegal mem access and possible accuracy lose. Cherry-pick … (#5017 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-06-09 17:50:57 +08:00
ChristinaZ	f45aff2b7d	Add customized renormalized moe routing kernel for moe cutlass backend (#4955 ) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>	2025-06-09 17:38:50 +08:00
Chuang Zhu	9a874760c1	Kv cache transfer support duplicate heads (#4929 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-09 14:11:19 +08:00
Chuang Zhu	947571c311	Fix buffer count (#5007 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-06-09 14:01:13 +08:00
Daniel Stokes	3a4851b7c3	feat: Add Mixture of Experts FP8xMXFP4 support (#4750 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-06-09 13:25:04 +08:00
Omer Ullman Argov	8731f5f14f	chore: Mass integration of release/0.20 (#4898 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Signed-off-by: ruodil <200874449+ruodil@users.noreply.github.com> Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com> Signed-off-by: moraxu <mguzek@nvidia.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com> Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com> Co-authored-by: HuiGao-NV <huig@nvidia.com> Co-authored-by: brb-nv <169953907+brb-nv@users.noreply.github.com> Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Co-authored-by: Bo Li <22713281+bobboli@users.noreply.github.com> Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com> Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com> Co-authored-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Faraz <58580514+farazkh80@users.noreply.github.com> Co-authored-by: Michal Guzek <moraxu@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>	2025-06-08 23:26:26 +08:00
dongxuy04	1e369658f1	feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818 ) Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-06-08 10:25:18 +08:00
Jinyang Yuan	20d0649f19	[feat] Support XQA-based MLA on SM120 (#4858 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com> Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com> Co-authored-by: Yao Yao <lowsfer@users.noreply.github.com> Co-authored-by: peaceh-nv <103117813+peaceh-nv@users.noreply.github.com>	2025-06-06 22:32:49 +08:00
Anthony Chang	eeb555e37b	chore: memoize weight shuffle index to speed up weight preproc in moe_backend=TRTLLM (#4826 ) Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>	2025-06-06 16:13:54 +08:00
qsang-nv	180b91f957	update fmha_v2 (#4895 ) Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>	2025-06-05 22:14:28 +08:00
dongjiyingdjy	51652b9b2b	feat : add PositionEmbeddingType=0 to xqa support (#4934 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-06-05 21:50:42 +08:00
Shiyu Li	b0d287c9b7	[TRTLLM-4647][fix] Fix the no fusion allreduce hanging (#4594 ) Signed-off-by: Shiyu Li <shili@nvidia.com>	2025-06-04 18:26:13 -07:00
Zheng Duan	dd2191c5b3	fix: correct the order of llm request state (#4781 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 14:45:13 +08:00
Omer Ullman Argov	e71de2a13e	chore: Mass integration of release/0.20. (#4871 ) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Omer Ullman Argov <118735753+omera-nv@users.noreply.github.com> Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>	2025-06-04 14:12:27 +08:00
Zheng Duan	ded694b1aa	feat: cache reuse support (selective cache transfer) in mla cache formatter (#4749 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-06-04 09:56:31 +08:00
ChristinaZ	d64af85e8c	Replace memset with data initialization within kernels (#4851 ) Signed-off-by: Christina Zhang <christinaz@nvidia.com>	2025-06-04 08:56:46 +08:00
Perkz Zheng	a089aa3225	[https://nvbugspro.nvidia.com/bug/5300080 ] Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693 ) Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-06-03 19:02:57 -04:00
Nikita Korobov	8043d7a03c	feat: update DeepSeek FP8 TRT-LLM Gen cubins (#4643 ) Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>	2025-06-03 14:07:54 -07:00
Robin Kobus	3de02582dd	refactor: Separate DecoderState from GptDecoderBatched (#4700 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-06-03 09:42:01 +02:00
Robin Kobus	b9263a8e10	fix: max_num_sequences calculation with overlap scheduling (#4532 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-03 09:31:22 +02:00
Tian Zheng	9832787050	[feat] Enable NVFP4 output for TRTLLM attention kernels (#4737 ) Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>	2025-06-03 10:00:17 +08:00
Yilin Fan	90aab0596e	[fix] Fix Llama4 guradwords failures (#4844 ) Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>	2025-06-02 13:43:42 -07:00
Enwei Zhu	5b4852b7b5	feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-02 01:25:02 +08:00
Netanel Haber	2ce05c3ab4	'entered copyBlock' format string expects %s, pass string rather than int (#4820 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-06-01 08:54:33 -07:00
tomeras91	bf9cd11fd4	[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H (#4494 ) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-06-01 13:56:44 +03:00
Daniel Cámpora	69c7fe8905	[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-06-01 03:32:43 +08:00
Enwei Zhu	25dde49c28	fix: EP load balancer with MTP layer and route offset by EP rank (#4767 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-06-01 00:07:44 +08:00
Dom Brown	338d6e9f95	[nvbug 5305210] fix: Resolve nvbug 5305210 (#4759 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-05-31 19:21:06 +08:00
Chuang Zhu	f117d6abe9	Fabric Memory for KV Cache Transfer (#4717 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-30 15:50:21 +08:00

... 4 5 6 7 8 ...

867 Commits