TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
sunnyqgg	2e5850c28a	[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference (#7363 ) Signed-off-by: qgai <qgai@nvidia.com>	2025-09-26 11:28:05 +08:00
QI JUN	4c0f8482f1	[None][ci] Waive test_mm_encoder_standalone.py::test_multi_request_batch_chat[llava-v1.6-mistral-7b-hf] (#8010 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-26 11:07:54 +08:00
Enwei Zhu	d650320de4	[None][infra] Improve the failure message for accuracy test suite (#7994 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-26 10:04:47 +08:00
Yiqing Yan	108248ece1	[TRTLLM-7999][infra] Add B300/GB300 single gpu test (#7951 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-09-26 09:59:11 +08:00
QI JUN	1529a6f22d	[None][chore] extract weights loading related logic to model loader (#7579 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-25 10:19:22 -07:00
Emma Qiao	2dc93c6371	[None][infra] Waive failed tests on main (#8001 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-25 08:13:39 -07:00
xxi	57ff5f4c0d	[None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 (#7954 ) Signed-off-by: xxi <xxi@nvidia.com>	2025-09-25 07:53:42 -07:00
Matthias Jouanneaux	eda1467061	[TRTLLM-5966][feat] Helix: add alltoall op (#6815 ) Signed-off-by: Matthias Jouanneaux <mjoux@nvidia.com>	2025-09-25 07:18:29 -07:00
Guoming Zhang	202bed4574	[None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. (#7851 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	5999fab146	[https://nvbugs/5427043 ][fix] cherrypick: request length exceeds max_num_tokens (#7718 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	cb466a846d	[None][fix] api stability bug in status label (#7861 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Guoming Zhang	9f0f52249e	[None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
Yan Chunwei	5342c607cd	[https://nvbugs/5516710 ][fix] fix Llama 3.3 TP PP case (#7717 ) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>	2025-09-25 21:02:35 +08:00
xinhe-nv	e30d9aced9	[https://nvbugs/4955671 ][fix] update test list (#7980 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-09-25 02:58:09 -07:00
Chuang Zhu	791e73edf6	[https://nvbugs/5536141 ][fix] fix_disagg_single_gpu_test (#7990 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-09-25 02:07:22 -07:00
Emma Qiao	cb53261aaf	[None][infra] Unwaive some tests since dev already have a PR to collect more info (#7984 ) Signed-off-by: qqiao <qqiao@nvidia.com>	2025-09-25 01:03:13 -07:00
fredricz-20070104	0945403174	[TRTLLM-6541][test] Add NIM perf test cases (#7924 ) Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>	2025-09-25 13:15:26 +08:00
Iman Tabrizian	be7e51727e	[https://nvbugs/5456485 ][bug] unwaive triton test (#7966 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-24 17:02:55 -07:00
Iman Tabrizian	da30d496b0	[None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756 )" (#7969 ) Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-09-24 15:36:38 -07:00
sychen52	5a65af24cd	[OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels (#7821 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2025-09-24 12:14:35 -07:00
Mike Iovine	42c2ec3239	[https://nvbugs/5473781 ][fix] Fix llama 4 FP8 for PP>1 (#7220 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-09-24 12:16:27 -04:00
Pamela Peng	b1dc84b4a3	[TRTLLM-7399][test] Add DS-R1/Qwen3 test cases for RTX 6000 (#7662 ) Signed-off-by: Pamela <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-09-24 11:40:26 -04:00
Yuxian Qiu	48fda86c56	[None][fix] Fix dummy load format for DeepSeek. (#7874 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-09-24 23:03:16 +08:00
Eran Geva	603517f72a	[#7675 ][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888 ) Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>	2025-09-24 10:11:44 -04:00
HuiGao-NV	c8bda4b3a9	[None][ci] Waive some intermittent failures (#7955 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-24 19:00:38 +08:00
Enwei Zhu	a1a57e83b8	[TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve (#7925 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-24 18:30:23 +08:00
xinhe-nv	b8bfa63197	[None][chore] add test_w4_1gpu[True-True-cutlass-fp8] & TestKimiK2::test_fp8_blocks… (#7944 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-09-24 03:25:17 -07:00
QI JUN	18ff1e31b8	[None][ci] remove duplicate test cases (#7956 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-24 17:47:22 +08:00
yufeiwu-nv	f323b74d42	[None][test] Update llm_models_root to improve path handling on BareMetal environment (#7876 ) Signed-off-by: yufeiwu <230315618+yufeiwu-nv@users.noreply.github.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: ruodil <200874449+ruodil@users.noreply.github.com>	2025-09-24 17:35:57 +08:00
HuiGao-NV	29e63d3bc2	[https://nvbugs/5532248 ][fix] Fix fused_moe OOM (#7931 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-09-24 02:22:38 -07:00
QI JUN	946ffcd2eb	[None][ci] optimize test cases of dgx b200 (#7948 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-09-24 00:39:45 -07:00
Cao Dong	2f8dc6feb0	[None][feat] Return topk logprobs in torch backend (#7756 ) Signed-off-by: Dong Cao <docao@nvidia.com>	2025-09-24 15:30:39 +08:00
xinhe-nv	62563760fb	[None][chore] update chunked prefill cases (#7921 ) Signed-off-by: Xin He (SW-GPU) <200704525+xinhe-nv@users.noreply.github.com>	2025-09-24 15:14:49 +08:00
Pengbo Wang	b890d7fea4	[None][infra] Skip failed test for nvbugs 5537738 (#7946 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 23:48:50 -07:00
Yueh-Ting (eop) Chen	cf100933cc	[TRTLLM-6341][feature] Support SWA KV cache reuse (#6768 ) This merge request attempts to support more SWA KV cache functionality inside the KV cache manager. Before this merge request, the KV cache for sliding window attention (SWA) only holds "window size" number of blocks and reuse them in a cyclic manner. We will not be able to utilize more GPU memory with this design, leading to a limited max batch size throughput. Additionally, we will not be able to support KV cache reuse with this design. In this MR, we change such behavior to let the manager write blocks in a linear manner. With a linear block writing behavior, as the attention window moves on, the out-of-window (OOW) blocks will be detached. Right now for the sake of a correct feature first, we directly offload the OOW block from the primary block pool (GPU memory) to the secondary block pool (host memory). We will improve this in the future by delegating the block movement to the eviction policy. KV cache reuse for SWA is not developed in this merge request and will be amended in a follow-up merge request. Writing the blocks linearly, the maximum number of blocks allocated for a sequence(`GenerationRequest`) is the "max sequence length" specified. The `GenerationRequest` that stores the cache block bookkeeping structure will now keep "max sequence length" tokens of blocks. Given the above, main changes are (more context in the MR): - Remove "cyclic" concept under the kv cache manager, such concept originally guards the block reuse under kv cache manager. - Add detach mechanism and have it under `KVCacheManager::addToken`. Please note that detach is still guarded off for SWA when reuse is enabled. A follow-up merge request will proceed to improve this. - Enforce "max sequence length" to be a non-optional parameter to the `KVCacheManager`/`BlockManager` - Let all window size resource pool get identical proportion of memory - Fix free memory calculation under `resource_manager.py` Signed-off-by: eopXD <yuehtingc@nvidia.com> Co-authored-by: Tomer Asida <tasida@nvidia.com>	2025-09-24 14:28:24 +08:00
Yuan Tong	70c3b100eb	[#7692 ][fix] recognize RequestError as per-request error in background handler (#7726 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-09-24 11:11:17 +08:00
Lizhi Zhou	e4f1f90202	[https://nvbugs/5477404 ][chore] unwaive test_disaggregated_single_gpu.py::test_disaggregated_llama_context_capacity (#7857 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-09-24 10:31:35 +08:00
Venky	6ff0fad75e	[TRTLLM-7015] [feat] Enable `prompt_logprobs` in pytorch backend (#7580 ) Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-09-23 18:48:10 -07:00
Lizhi Zhou	7550251988	[TRTLLM-7182][test] add multi-nodes test for disagg-serving (#7470 ) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>	2025-09-24 08:31:56 +08:00
mpikulski	9970345919	[TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294 ) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>	2025-09-23 16:05:05 -07:00
Zheng Duan	e3c1a9409f	[TRTLLM-6549][fix] add kv cache time output back (#7798 ) Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>	2025-09-23 14:12:42 -04:00
Yanchao Lu	6a36349964	[None][test] Waive another intermittent OOM test (#7930 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-09-23 22:34:09 +08:00
Zheyu Fu	34963ec39c	[None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off (#7511 ) Signed-off-by: Zheyu Fu <zheyuf@NVIDIA.com>	2025-09-23 06:54:18 -07:00
ruodil	05bec3bf0f	[None][test] rename llm_perf_full to llm_perf_core and add missing cases (#7899 ) Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com> Co-authored-by: Ruodi Lu <ruodil@users.noreply.github.com>	2025-09-22 23:04:34 -07:00
Pengbo Wang	a4b4ed4535	[None][fix] Fix and add test for TRTLLM MoE backend (#7755 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 11:26:25 +08:00
Pengbo Wang	08cc7a041f	[https://nvbugs/5355128 ][fix] Add missing wgmma intrinsic for starcoder (#7643 ) Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>	2025-09-23 10:38:58 +08:00
yunruis	126cd707e3	[None][opt] Add batch waiting when scheduling (#7416 ) Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-09-23 10:27:37 +08:00
Chang Liu	998857bcde	[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) (#7577 ) Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>	2025-09-22 19:07:18 -07:00
Enwei Zhu	8330d5363a	[TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-09-23 09:10:09 +08:00
xxi	d471655242	[TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712 )	2025-09-23 08:41:38 +08:00

1 2 3 4 5 ...

1584 Commits