Void
118307c224
DeepEP LL support variable hidden size and tokens num ( #6141 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-07-20 09:32:41 +08:00
Pengyun Lin
69e9f6d489
[fix]: Skip prompt length checking for generation only requests ( #6146 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-19 21:26:37 +08:00
Ziyi Xiong
66030ef815
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support ( #6133 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-07-19 13:17:15 +08:00
Rashid Kaleem
152e2df43b
[Disaggregated] Add retry knobs and handling ( #5808 )
...
Signed-off-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-19 07:27:59 +08:00
John Calderon
fc8b29c4ff
[Issue 5927][fix] Avoid memory calls during broadcast for single GPU ( #6010 )
...
Signed-off-by: John Calderon <johncalesp@gmail.com>
2025-07-18 14:21:03 -07:00
Netanel Haber
d9a3530048
[nvbug/5393888][nvbug/5393042] Always use py_seq_slot ( #6147 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-07-18 22:45:16 +03:00
Stefan Niebler
6d7874a467
[nvbugs/5369799] fix: Update disaggregation handling in sampler ( #5762 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-19 01:40:46 +08:00
xiaoqi
28858c8711
feat(eagle3):support qwen3 dense model ( #5879 )
...
Signed-off-by: xq25478 <xq25478@qq.com>
2025-07-19 01:24:32 +08:00
Bo Li
07e8813984
feat: Remove padding in attention DP. ( #6064 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-18 23:30:34 +08:00
Stefan Niebler
fd6ce7f20e
[ci] Speedup beam search unit tests with fixtures for LLM ( #5843 )
...
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-18 22:54:49 +08:00
Erin
9522cde464
fix: NVBug 5385576 py_batch_idx issue ( #6153 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-18 22:36:43 +08:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager ( #6055 )
...
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Aurelien Chartier
812243bdd6
feat: add support for Modelopt fp8_pb_wo quantization scheme ( #6106 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-07-18 10:35:12 +08:00
Zhenhuan Chen
992b273045
[ https://nvbugs/5387375 ] fix(scaffolding): fix scaffolding aime test in test_e2e ( #6140 )
...
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-18 10:34:37 +08:00
yifeizhang-c
0155e7a3a1
[TRTLLM-6368] Update deepep dispatch API ( #6037 )
...
Signed-off-by: Yifei Zhang <219273404+yifeizhang-c@users.noreply.github.com>
2025-07-18 10:13:31 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings ( #5961 )" ( #6160 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
qixiang-99
2c90203c36
Refactor KVCacheManager: Simplify token availability calculation and … ( #6134 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-17 13:33:33 -07:00
Frank
161490f039
[fix] Fixes KV Cache overrides in trtllm-bench ( #6103 )
...
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-07-18 03:44:44 +08:00
2ez4bz
8480c120b1
[fix] Fix Mistral3VLM weight-loading & enable in pre-merge ( #6105 )
...
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
2025-07-17 11:04:17 -07:00
Iman Tabrizian
10dbf4f0f4
[fix] Remove duplicated KVCache transmission check ( #6022 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-17 12:02:19 -04:00
Linda
5bff317abf
feat: nanobind bindings ( #5961 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Ziyi Xiong
58d22a72f1
[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter ( #6007 )
...
Signed-off-by: ziyixiong-nv <fxiong@nvidia.com>
2025-07-17 21:15:01 +08:00
Enwei Zhu
21efb50068
[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler ( #6000 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-17 17:46:10 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service ( #5234 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
Iman Tabrizian
d4d21a106e
[fix] Release slots with spec decode + disagg ( #5975 ) ( #6032 )
...
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-17 12:58:18 +08:00
Shiyu Li
6e1aee6fd6
[fix] Performance Optimization for MNNVL TwoShot Kernel ( #5934 )
...
Signed-off-by: Shiyu Li <shili@nvidia.com>
Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-07-17 10:49:51 +08:00
chenfeiz0326
fe070a0168
test: Update Llama4 Scout FP4 & FP8 accuracy tests ( #5901 )
...
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
2025-07-17 09:41:18 +08:00
Wanli Jiang
2d2b8bae32
feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support ( #5644 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-17 06:30:58 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations ( #5868 )
...
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
Mike Iovine
fa34cb7234
[refactor] Clean up drafter/resource manager creation logic ( #5805 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-16 12:45:46 -07:00
shaharmor98
e0836f9ca9
[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats ( #5372 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-17 00:50:30 +08:00
Wanli Jiang
9354114f68
fix: Update trtllm args issues with extra nested config ( #5996 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-16 12:41:45 -04:00
Bo Li
fc2347eaf5
chore: Cleanup disable_fp4_allgather. ( #6006 )
...
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
2025-07-16 17:54:36 +08:00
Yan Chunwei
a02606a9e2
[TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend ( #5752 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:42:59 +08:00
Yan Chunwei
7568deb2f1
[nvbug/5387226] chore: add propogation for trust_remote_code to AutoConfig ( #6001 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:05:38 +08:00
Yiqing Yan
e51c541617
chore: Bump version to 1.0.0rc4 ( #6086 )
...
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-16 13:02:23 +08:00
Wanli Jiang
8679a058a3
fix: Unable to load phi4-model with tp_size>1 ( #5962 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-16 11:39:41 +08:00
danielafrimi
edab7532dd
feat/add latency support for trtllm bench ( #3730 )
...
Signed-off-by: Ubuntu <dafrimi@nvidia.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Daniel Afrimi <dafrimi@nvidia.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-07-15 13:13:49 -07:00
Fanrong Li
7a1af1c738
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/5947 ( #5989 )
...
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-07-16 01:33:12 +09:00
Xiaodong (Vincent) Huang
0523f77b36
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… ( #5684 )
...
Signed-off-by: Vincent Huang <vincenth@nvidia.com>
2025-07-15 18:34:21 +03:00
Tailing Yuan
4a26bd6500
Fix: pad DeepEP fp4 recv tensors if empty ( #6048 )
...
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
2025-07-15 23:14:01 +09:00
MinaHuai
9ebc3ab9c4
[nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision ( #5998 )
...
Signed-off-by: Mina Huai <121143971+MinaHuai@users.noreply.github.com>
2025-07-15 10:01:35 -04:00
Jaedeok Kim
ab1c54709d
fix: adjust window sizes of VSWA at torch backend ( #5880 )
...
Signed-off-by: Jaedeok Kim <jaedeokk@nvidia.com>
2025-07-15 17:41:54 +08:00
nv-guomingz
4e4d18826f
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… ( #6003 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-15 15:50:03 +09:00
Lucas Liebenwein
e499f6c44a
[Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils ( #6026 )
...
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-07-15 14:31:35 +09:00
Rashid Kaleem
2ea4077993
[Model load] Fix llama min-latency model load ( #5883 )
...
Signed-off-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
2025-07-15 09:29:19 +08:00
ixlmar
f225f5cd2e
[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs ( #5964 )
...
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-07-15 06:49:42 +08:00
brb-nv
f5f5be9e94
enh: Bidirectional mask with multiple images for Gemma3 ( #5976 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 22:39:18 +08:00
brb-nv
1a2d96919c
feat: Update Gemma3 Vision Encoder ( #5973 )
...
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-07-14 22:38:10 +08:00
Yechan Kim
63139fdcff
feat: EXAONE4.0 support ( #5696 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-14 22:28:10 +09:00