Guoming Zhang
202bed4574
[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. ( #7851 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
QI JUN
961418908c
[ https://nvbugs/5531963 ][fix] cherry pick #7725 ( #7907 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
5999fab146
[ https://nvbugs/5427043 ][fix] cherrypick: request length exceeds max_num_tokens ( #7718 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
cb466a846d
[None][fix] api stability bug in status label ( #7861 )
...
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
9d48898def
[None][doc] add stable label to all the un-labelled arguments in LLM class ( #7863 )
...
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Zac Patel
c38d4cf6a6
[None][doc] Update Perf-Overview.md for release/1.0 ( #7848 )
...
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
57c098956e
[None][doc] add a guide for modifying APIs ( #7866 )
...
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Guoming Zhang
9f0f52249e
[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … ( #7850 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Guoming Zhang
5ecc8d0ee2
[None][doc] Replace the main in the examples' link with commit id. ( #7837 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Yan Chunwei
5342c607cd
[ https://nvbugs/5516710 ][fix] fix Llama 3.3 TP PP case ( #7717 )
...
Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Tao Li @ NVIDIA
44d7c3b245
[ https://nvbugs/1234567 ][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files ( #7813 )
...
Signed-off-by: Tao Li
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
Guoming Zhang
4a09be40f0
[None][doc] Update docker cmd in quick start guide and trtllm-serve … ( #7787 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
2025-09-25 21:02:35 +08:00
xinhe-nv
e30d9aced9
[ https://nvbugs/4955671 ][fix] update test list ( #7980 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-25 02:58:09 -07:00
Chuang Zhu
791e73edf6
[ https://nvbugs/5536141 ][fix] fix_disagg_single_gpu_test ( #7990 )
...
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-25 02:07:22 -07:00
Jinyang Yuan
b622cde5d5
[None][perf] Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices ( #7419 )
...
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
2025-09-25 10:27:57 +02:00
Emma Qiao
cb53261aaf
[None][infra] Unwaive some tests since dev already have a PR to collect more info ( #7984 )
...
Signed-off-by: qqiao <qqiao@nvidia.com>
2025-09-25 01:03:13 -07:00
Wanli Jiang
22b45ff9c7
[TRTLLM-7758][feat] Phi4-mm image modality inference optimization ( #7918 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-09-25 15:58:29 +08:00
WeiHaocheng
259cc66c34
[None][doc] scaffolding tech blog part one ( #7835 )
...
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: zheyuf <zheyuf@NVIDIA.com>
Co-authored-by: zheyuf <zheyuf@NVIDIA.com>
2025-09-25 14:41:59 +08:00
fredricz-20070104
0945403174
[TRTLLM-6541][test] Add NIM perf test cases ( #7924 )
...
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
2025-09-25 13:15:26 +08:00
Guoming Zhang
bb6067176f
[None][chroe] Update the cuda and tensorrt version in homepage icons. ( #7963 )
...
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-09-24 19:20:04 -07:00
Aurelien Chartier
98726a3bed
[None][chore] Update trtllm-bench documentation on setting FP8 KV cache ( #7885 )
...
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-09-25 09:28:53 +08:00
Void
336c2ef540
[None][feat] DeepEP LL fp8 dispatch/combine ( #7927 )
...
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-09-25 09:20:24 +08:00
Iman Tabrizian
be7e51727e
[ https://nvbugs/5456485 ][bug] unwaive triton test ( #7966 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 17:02:55 -07:00
Leslie Fang
342014069e
[None][chore] Validate features combination ( #7630 )
...
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
2025-09-25 08:01:13 +08:00
Iman Tabrizian
da30d496b0
[None][fix] Revert "[None][feat] Return topk logprobs in torch backend ( #7756 )" ( #7969 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 15:36:38 -07:00
sychen52
5a65af24cd
[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels ( #7821 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
2025-09-24 12:14:35 -07:00
Iman Tabrizian
6d45cd163e
[None][bug] Fix transformers version for Triton backend ( #7964 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-24 12:55:52 -04:00
Mike Iovine
42c2ec3239
[ https://nvbugs/5473781 ][fix] Fix llama 4 FP8 for PP>1 ( #7220 )
...
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-09-24 12:16:27 -04:00
Pamela Peng
b1dc84b4a3
[TRTLLM-7399][test] Add DS-R1/Qwen3 test cases for RTX 6000 ( #7662 )
...
Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com>
Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-09-24 11:40:26 -04:00
Yuxian Qiu
48fda86c56
[None][fix] Fix dummy load format for DeepSeek. ( #7874 )
...
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-09-24 23:03:16 +08:00
Macrocell
6e5e8b8a3b
[None][fix] fix get_iteration_stats IndexError ( #7216 )
...
Signed-off-by: yuhongwei <yumiao.yhw@antgroup.com>
Co-authored-by: yuhongwei <yumiao.yhw@antgroup.com>
2025-09-24 22:43:03 +08:00
Eran Geva
603517f72a
[ #7675 ][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) ( #7888 )
...
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
2025-09-24 10:11:44 -04:00
Yuan Tong
51bef1beb0
[None][chore] cleanup build script ( #7865 )
...
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-09-24 21:15:01 +08:00
Perkz Zheng
60101eb8a5
[None][fix] trtllm-gen cubins compiled with wrong arch. ( #7953 )
...
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-09-24 04:13:36 -07:00
HuiGao-NV
c8bda4b3a9
[None][ci] Waive some intermittent failures ( #7955 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-24 19:00:38 +08:00
Necofish
cfbcf9b9e8
[None][feat] Support Seed-OSS model in pytorch backend ( #7496 )
...
Signed-off-by: Nekofish-L <liuxiangyang@mail.ustc.edu.cn>
2025-09-24 03:57:12 -07:00
Enwei Zhu
a1a57e83b8
[TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve ( #7925 )
...
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-24 18:30:23 +08:00
xinhe-nv
b8bfa63197
[None][chore] add test_w4_1gpu[True-True-cutlass-fp8] & TestKimiK2::test_fp8_blocks… ( #7944 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-24 03:25:17 -07:00
QI JUN
18ff1e31b8
[None][ci] remove duplicate test cases ( #7956 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-24 17:47:22 +08:00
yufeiwu-nv
f323b74d42
[None][test] Update llm_models_root to improve path handling on BareMetal environment ( #7876 )
...
Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>
2025-09-24 17:35:57 +08:00
HuiGao-NV
29e63d3bc2
[ https://nvbugs/5532248 ][fix] Fix fused_moe OOM ( #7931 )
...
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-09-24 02:22:38 -07:00
JunyiXu-nv
6654b78c94
[ https://nvbugs/5521799 ][fix] Trim incorrectly generated harmony messages ( #7849 )
...
Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
2025-09-24 16:38:43 +08:00
Li Min
0252cee4c3
[None][chore] Recover cutlass-dsl pkg install and dsl op testing. ( #7945 )
...
Signed-off-by: Mindy Li <11663212+limin2021@users.noreply.github.com>
2025-09-24 15:45:18 +08:00
QI JUN
946ffcd2eb
[None][ci] optimize test cases of dgx b200 ( #7948 )
...
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-09-24 00:39:45 -07:00
Cao Dong
2f8dc6feb0
[None][feat] Return topk logprobs in torch backend ( #7756 )
...
Signed-off-by: Dong Cao <docao@nvidia.com>
2025-09-24 15:30:39 +08:00
xinhe-nv
62563760fb
[None][chore] update chunked prefill cases ( #7921 )
...
Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>
2025-09-24 15:14:49 +08:00
qsang-nv
929ef4c474
[None][chore] remove cubins for ci cases ( #7902 )
...
Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
2025-09-24 14:56:31 +08:00
Pengbo Wang
b890d7fea4
[None][infra] Skip failed test for nvbugs 5537738 ( #7946 )
...
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
2025-09-23 23:48:50 -07:00
xiweny
276d83c898
[ https://nvbugs/5532225 ] [fix] MoE use stream-dependent workspace ( #7940 )
...
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
2025-09-24 14:44:27 +08:00
Yueh-Ting (eop) Chen
cf100933cc
[TRTLLM-6341][feature] Support SWA KV cache reuse ( #6768 )
...
This merge request attempts to support more SWA KV cache functionality
inside the KV cache manager. Before this merge request, the KV cache for
sliding window attention (SWA) only holds "window size" number of blocks
and reuse them in a cyclic manner. We will not be able to utilize more
GPU memory with this design, leading to a limited max batch size
throughput. Additionally, we will not be able to support KV cache reuse
with this design.
In this MR, we change such behavior to let the manager write blocks in
a linear manner. With a linear block writing behavior, as the attention
window moves on, the out-of-window (OOW) blocks will be detached. Right
now for the sake of a correct feature first, we directly offload the
OOW block from the primary block pool (GPU memory) to the secondary
block pool (host memory). We will improve this in the future by
delegating the block movement to the eviction policy.
KV cache reuse for SWA is not developed in this merge request and will
be amended in a follow-up merge request.
Writing the blocks linearly, the maximum number of blocks allocated for
a sequence(`GenerationRequest`) is the "max sequence length" specified.
The `GenerationRequest` that stores the cache block bookkeeping
structure will now keep "max sequence length" tokens of blocks.
Given the above, main changes are (more context in the MR):
- Remove "cyclic" concept under the kv cache manager, such concept
originally guards the block reuse under kv cache manager.
- Add detach mechanism and have it under `KVCacheManager::addToken`.
Please note that detach is still guarded off for SWA when reuse
is enabled. A follow-up merge request will proceed to improve this.
- Enforce "max sequence length" to be a non-optional parameter to
the `KVCacheManager`/`BlockManager`
- Let all window size resource pool get identical proportion of memory
- Fix free memory calculation under `resource_manager.py`
Signed-off-by: eopXD <yuehtingc@nvidia.com>
Co-authored-by: Tomer Asida <tasida@nvidia.com>
2025-09-24 14:28:24 +08:00